[Yahoo-eng-team] [Bug 2020028] [NEW] evacuate an instance on non-shared storage succeeded and boot image is rebuilt
Public bug reported: Description === evacuate an instance on non-shared storage succeeded and boot image is rebuilt Steps to reproduce == 1. Create a two compute nodes cluster without shared storage 2. boot a image backed virtual machine 3. shutdown down the compute node where vm is running 4. evacuate instance to another node Expected: evacuate failed Real: evacuate succeeded and boot image is rebuilt. Version === Using nova victoria version ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2020028 Title: evacuate an instance on non-shared storage succeeded and boot image is rebuilt Status in OpenStack Compute (nova): New Bug description: Description === evacuate an instance on non-shared storage succeeded and boot image is rebuilt Steps to reproduce == 1. Create a two compute nodes cluster without shared storage 2. boot a image backed virtual machine 3. shutdown down the compute node where vm is running 4. evacuate instance to another node Expected: evacuate failed Real: evacuate succeeded and boot image is rebuilt. Version === Using nova victoria version To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/2020028/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1999126] [NEW] resize to the same host unexpectedly clears host_info cache
Public bug reported: Description === We are using victoria nova and find that after a same-host-cold-migrate, subsequently cold migrate could break anti-affinity policy. Steps to reproduce == 1 provision a openstack cluster with 2 compute nodes 2 create a server group with anti-affinity rule 3 create two vms binding tot he server group 4 disable compute node B 5 cold migrate server on node A and confirm it 6 right after confirm succeeded, migrate server on node B 7 check vm host if not on the same host , then repeat 5 - 6 Expected result === servers will reside on different nodes ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1999126 Title: resize to the same host unexpectedly clears host_info cache Status in OpenStack Compute (nova): New Bug description: Description === We are using victoria nova and find that after a same-host-cold-migrate, subsequently cold migrate could break anti-affinity policy. Steps to reproduce == 1 provision a openstack cluster with 2 compute nodes 2 create a server group with anti-affinity rule 3 create two vms binding tot he server group 4 disable compute node B 5 cold migrate server on node A and confirm it 6 right after confirm succeeded, migrate server on node B 7 check vm host if not on the same host , then repeat 5 - 6 Expected result === servers will reside on different nodes To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1999126/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1996966] [NEW] get_machine_ips took too long to complete
Public bug reported: Description === I found that get_machine_ips could took too long before returning IP addresses. There are around 160 instances with about 200 nics which results in around 1000 network adapters on the host. calling netifaces.ifaddresses approximately took around 0.2 ~ 0.5 seconds. Steps to reproduce == 1. use an arm64 host or host load is high 2. booting 200 instances with neutron hybird SG driver enabled 3. reboot nova-compute Expected result === get_machine_ips should take no more than 2 seconds to return Actual result = Took around 500 seconds Environment === 1. phyitum arm64 as compute node ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1996966 Title: get_machine_ips took too long to complete Status in OpenStack Compute (nova): New Bug description: Description === I found that get_machine_ips could took too long before returning IP addresses. There are around 160 instances with about 200 nics which results in around 1000 network adapters on the host. calling netifaces.ifaddresses approximately took around 0.2 ~ 0.5 seconds. Steps to reproduce == 1. use an arm64 host or host load is high 2. booting 200 instances with neutron hybird SG driver enabled 3. reboot nova-compute Expected result === get_machine_ips should take no more than 2 seconds to return Actual result = Took around 500 seconds Environment === 1. phyitum arm64 as compute node To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1996966/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1995229] [NEW] [Opinion] Update instance availability_zone when reset host AZ
Public bug reported: Description === Instance.availability_zone is set in nova.conductor while scheduling. But host's availability_zone could be modified when host is added to an aggregate, but instance.availability_zone will not be changed, instead 'availabity_zone' will be cached in applications like memcached. The issue of this strategy causes /servers/detail to cost around extra 1 second when returned list containing more than 500 servers. So my proposal is to update instance.availability_zone when host is added to a new aggregate. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1995229 Title: [Opinion] Update instance availability_zone when reset host AZ Status in OpenStack Compute (nova): New Bug description: Description === Instance.availability_zone is set in nova.conductor while scheduling. But host's availability_zone could be modified when host is added to an aggregate, but instance.availability_zone will not be changed, instead 'availabity_zone' will be cached in applications like memcached. The issue of this strategy causes /servers/detail to cost around extra 1 second when returned list containing more than 500 servers. So my proposal is to update instance.availability_zone when host is added to a new aggregate. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1995229/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1995028] [NEW] list os-service causing reconnects to memcached all the time
Public bug reported: Description === we are running a victoria openstack cluster (python3). and I observe that everytime when an openstack compute service list executed, nova-api will create a new connection to memcache. Actually there are several reasons to cause this behavior 1. when running natively with eventlet's wsgi server, everytime a new coroutine is created to host web request and this causes keystonemiddle auth_token which uses python-memcached to reconnect to memcahced all the time 2. os-services will trigger nova.availability_zones.set_availability_zones and it will update cache every time, since cellv2 is enabled, this method is running in an co-routine as well 3. python-memcached's Client is inheriting from threading.local which will be monkey_patched to use eventlet's implementation and thus for every co-routine context it will create a new connection Steps to reproduce == 1. Patch def _get_socket and print connection 2. execute openstack compute service list Expected result === Maintain stable connections to memcached Actual result = Reconnects Environment === 1. devstack victoria openstack ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1995028 Title: list os-service causing reconnects to memcached all the time Status in OpenStack Compute (nova): New Bug description: Description === we are running a victoria openstack cluster (python3). and I observe that everytime when an openstack compute service list executed, nova- api will create a new connection to memcache. Actually there are several reasons to cause this behavior 1. when running natively with eventlet's wsgi server, everytime a new coroutine is created to host web request and this causes keystonemiddle auth_token which uses python-memcached to reconnect to memcahced all the time 2. os-services will trigger nova.availability_zones.set_availability_zones and it will update cache every time, since cellv2 is enabled, this method is running in an co-routine as well 3. python-memcached's Client is inheriting from threading.local which will be monkey_patched to use eventlet's implementation and thus for every co-routine context it will create a new connection Steps to reproduce == 1. Patch def _get_socket and print connection 2. execute openstack compute service list Expected result === Maintain stable connections to memcached Actual result = Reconnects Environment === 1. devstack victoria openstack To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1995028/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1995029] [NEW] list os-service causing reconnects to memcached all the time
Public bug reported: Description === we are running a victoria openstack cluster (python3). and I observe that everytime when an openstack compute service list executed, nova-api will create a new connection to memcache. Actually there are several reasons to cause this behavior 1. when running natively with eventlet's wsgi server, everytime a new coroutine is created to host web request and this causes keystonemiddle auth_token which uses python-memcached to reconnect to memcahced all the time 2. os-services will trigger nova.availability_zones.set_availability_zones and it will update cache every time, since cellv2 is enabled, this method is running in an co-routine as well 3. python-memcached's Client is inheriting from threading.local which will be monkey_patched to use eventlet's implementation and thus for every co-routine context it will create a new connection Steps to reproduce == 1. Patch def _get_socket and print connection 2. execute openstack compute service list Expected result === Maintain stable connections to memcached Actual result = Reconnects Environment === 1. devstack victoria openstack ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1995029 Title: list os-service causing reconnects to memcached all the time Status in OpenStack Compute (nova): New Bug description: Description === we are running a victoria openstack cluster (python3). and I observe that everytime when an openstack compute service list executed, nova- api will create a new connection to memcache. Actually there are several reasons to cause this behavior 1. when running natively with eventlet's wsgi server, everytime a new coroutine is created to host web request and this causes keystonemiddle auth_token which uses python-memcached to reconnect to memcahced all the time 2. os-services will trigger nova.availability_zones.set_availability_zones and it will update cache every time, since cellv2 is enabled, this method is running in an co-routine as well 3. python-memcached's Client is inheriting from threading.local which will be monkey_patched to use eventlet's implementation and thus for every co-routine context it will create a new connection Steps to reproduce == 1. Patch def _get_socket and print connection 2. execute openstack compute service list Expected result === Maintain stable connections to memcached Actual result = Reconnects Environment === 1. devstack victoria openstack To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1995029/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1991380] [NEW] centos 7.6 cannot access 169.254.169.254
Public bug reported: Hello, I am testing centos 7.6 using an Victoria Openstack. In the virtual machine, I am finding the route looks like below # ip r default via 172.31.0.1 dev eth0 192.168.0.0/16 dev eth1 proto kernel scope link src 192.168.0.9 169.254.0.0/16 dev eth0 scope link metric 1002 169.254.0.0/16 dev eth1 scope link metric 1003 As it shows 169.254.0.0/16 seems overriding the 169.254.169.254 route and causing VM failed to access metadata. Any idea why such situation happens? Thank you. ** Affects: cloud-init Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to cloud-init. https://bugs.launchpad.net/bugs/1991380 Title: centos 7.6 cannot access 169.254.169.254 Status in cloud-init: New Bug description: Hello, I am testing centos 7.6 using an Victoria Openstack. In the virtual machine, I am finding the route looks like below # ip r default via 172.31.0.1 dev eth0 192.168.0.0/16 dev eth1 proto kernel scope link src 192.168.0.9 169.254.0.0/16 dev eth0 scope link metric 1002 169.254.0.0/16 dev eth1 scope link metric 1003 As it shows 169.254.0.0/16 seems overriding the 169.254.169.254 route and causing VM failed to access metadata. Any idea why such situation happens? Thank you. To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-init/+bug/1991380/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1988281] [NEW] neutron dhcp agent state not consistent with real status
Public bug reported: We are observing that neutron-dhcp-agent's state is deviating from "real state", by saying real state, I mean all hosted dnsmasq are running and configured. For example, agent A is hosting 1,000 networks, if I reboot agent A then all dnsmasq processes are gone, and dhcp agent will try to reboot every dnsmasq, this will introduce a long delay between agent start and agent handles new rabbitmq messages. But weirdly, openstack network agent list will show that the agent is up and running which IMO is inconsistent. I think under this situation, openstack network agent list should report the corresponding agent to be down. ** Affects: neutron Importance: Undecided Status: New ** Description changed: - We are observing that neutron-dhcp-agent's state is deviating from "real state", by saying real state, I mean - all hosted dnsmasq are running and configured. For example, agent A is hosting 1,000 networks, if I reboot agent A then all dnsmasq processes are gone, and dhcp agent will try to reboot every dnsmasq, this will introduce a long delay between agent start and agent handles new rabbitmq messages. But weirdly, openstack network agent list will show that the agent is up and running which IMO is inconsistent. + We are observing that neutron-dhcp-agent's state is deviating from "real + state", by saying real state, I mean all hosted dnsmasq are running and + configured. + + For example, agent A is hosting 1,000 networks, if I reboot agent A then + all dnsmasq processes are gone, and dhcp agent will try to reboot every + dnsmasq, this will introduce a long delay between agent start and agent + handles new rabbitmq messages. But weirdly, openstack network agent list + will show that the agent is up and running which IMO is inconsistent. I + think under this situation, openstack network agent list should report + the corresponding agent to be down. -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1988281 Title: neutron dhcp agent state not consistent with real status Status in neutron: New Bug description: We are observing that neutron-dhcp-agent's state is deviating from "real state", by saying real state, I mean all hosted dnsmasq are running and configured. For example, agent A is hosting 1,000 networks, if I reboot agent A then all dnsmasq processes are gone, and dhcp agent will try to reboot every dnsmasq, this will introduce a long delay between agent start and agent handles new rabbitmq messages. But weirdly, openstack network agent list will show that the agent is up and running which IMO is inconsistent. I think under this situation, openstack network agent list should report the corresponding agent to be down. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1988281/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1982902] [NEW] umount /run/cloud-init/tmp/tmpl5n7csdd failed
Public bug reported: Hello, I am using cloud-init version: /usr/bin/cloud-init 20.4.1-0ubuntu1~18.04.1, ubuntu version is root@ubuntu:~# lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description:Ubuntu 18.04.5 LTS Release:18.04 Codename: bionic I found that umount configdrive fails with device busy reported, it further causes temp folder failed to be deleted. Logs are ``` 2022-07-25 02:13:01,732 - handlers.py[DEBUG]: finish: init-local/search-ConfigDrive: FAIL: no local data found from DataSourceConfigDrive 2022-07-25 02:13:01,733 - util.py[WARNING]: Getting data from failed 2022-07-25 02:13:01,733 - util.py[DEBUG]: Getting data from failed Traceback (most recent call last): File "/usr/lib/python3/dist-packages/cloudinit/temp_utils.py", line 90, in tempdir yield tdir File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 1687, in mount_cb return ret File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__ next(self.gen) File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 1571, in unmounter subp.subp(umount_cmd) File "/usr/lib/python3/dist-packages/cloudinit/subp.py", line 295, in subp cmd=args) cloudinit.subp.ProcessExecutionError: Unexpected error while running command. Command: ['umount', '/run/cloud-init/tmp/tmpl5n7csdd'] Exit code: 32 Reason: - Stdout: Stderr: umount: /run/cloud-init/tmp/tmpl5n7csdd: target is busy. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 771, in find_source if s.update_metadata([EventType.BOOT_NEW_INSTANCE]): File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 660, in update_metadata result = self.get_data() File "/usr/lib/python3/dist-packages/cloudinit/sources/__init__.py", line 279, in get_data return_value = self._get_data() File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceConfigDrive.py", line 81, in _get_data mtype=mtype) File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 1687, in mount_cb return ret File "/usr/lib/python3.6/contextlib.py", line 99, in __exit__ self.gen.throw(type, value, traceback) File "/usr/lib/python3/dist-packages/cloudinit/temp_utils.py", line 92, in tempdir shutil.rmtree(tdir, ignore_errors=rmtree_ignore_errors) File "/usr/lib/python3.6/shutil.py", line 486, in rmtree _rmtree_safe_fd(fd, path, onerror) File "/usr/lib/python3.6/shutil.py", line 424, in _rmtree_safe_fd _rmtree_safe_fd(dirfd, fullname, onerror) File "/usr/lib/python3.6/shutil.py", line 424, in _rmtree_safe_fd _rmtree_safe_fd(dirfd, fullname, onerror) File "/usr/lib/python3.6/shutil.py", line 444, in _rmtree_safe_fd onerror(os.unlink, fullname, sys.exc_info()) File "/usr/lib/python3.6/shutil.py", line 442, in _rmtree_safe_fd os.unlink(name, dir_fd=topfd) OSError: [Errno 30] Read-only file system: 'network_data.json' 2022-07-25 02:13:01,783 - main.py[DEBUG]: No local datasource found ``` ** Affects: cloud-init Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to cloud-init. https://bugs.launchpad.net/bugs/1982902 Title: umount /run/cloud-init/tmp/tmpl5n7csdd failed Status in cloud-init: New Bug description: Hello, I am using cloud-init version: /usr/bin/cloud-init 20.4.1-0ubuntu1~18.04.1, ubuntu version is root@ubuntu:~# lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description:Ubuntu 18.04.5 LTS Release:18.04 Codename: bionic I found that umount configdrive fails with device busy reported, it further causes temp folder failed to be deleted. Logs are ``` 2022-07-25 02:13:01,732 - handlers.py[DEBUG]: finish: init-local/search-ConfigDrive: FAIL: no local data found from DataSourceConfigDrive 2022-07-25 02:13:01,733 - util.py[WARNING]: Getting data from failed 2022-07-25 02:13:01,733 - util.py[DEBUG]: Getting data from failed Traceback (most recent call last): File "/usr/lib/python3/dist-packages/cloudinit/temp_utils.py", line 90, in tempdir yield tdir File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 1687, in mount_cb return ret File "/usr/lib/python3.6/contextlib.py", line 88, in __exit__ next(self.gen) File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 1571, in unmounter subp.subp(umount_cmd) File "/usr/lib/python3/dist-packages/cloudinit/subp.py", line 295, in subp cmd=args) cloudinit.subp.ProcessExecutionError: Unexpected error while running command. Command: ['umount', '/run/cloud-init/tmp/tmpl5n7csdd'] Exit code: 32 Reason: - Stdout: Stderr: umount: /run/cloud-init/tmp/tmpl5n7csdd: target is busy. During handling of the above
[Yahoo-eng-team] [Bug 1978827] [NEW] rebuild instance continues to flush old mpath on failure
Public bug reported: Description === When rebuilding instance failed due to a potentially problematic cinder API, then when trying to rebuild again, nova will try to disconnect volume again although the path has already clearer. It is generally OK for rbd backend, but it could cause problem for FC san if new volume got attached during two consecutive rebuilds which could consume the previous lun id. Environment It could happen for all versions. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1978827 Title: rebuild instance continues to flush old mpath on failure Status in OpenStack Compute (nova): New Bug description: Description === When rebuilding instance failed due to a potentially problematic cinder API, then when trying to rebuild again, nova will try to disconnect volume again although the path has already clearer. It is generally OK for rbd backend, but it could cause problem for FC san if new volume got attached during two consecutive rebuilds which could consume the previous lun id. Environment It could happen for all versions. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1978827/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1973656] [NEW] meaning of option "router_auto_schedule" is ambiguous
Public bug reported: I found meaning of option "router_auto_schedule" is hard to follow. A quick code review finds it is only used at (tests excluded) ```python def get_router_ids(self, context, host): """Returns IDs of routers scheduled to l3 agent on This will autoschedule unhosted routers to l3 agent on and then return all ids of routers scheduled to it. """ if extensions.is_extension_supported( self.l3plugin, constants.L3_AGENT_SCHEDULER_EXT_ALIAS): if cfg.CONF.router_auto_schedule: self.l3plugin.auto_schedule_routers(context, host) return self.l3plugin.list_router_ids_on_host(context, host) ``` which seems to be fixing router without agents associated with it. And even if I turn this option off, router is still able to be properly scheduled to agents. because ```python @registry.receives(resources.ROUTER, [events.AFTER_CREATE], priority_group.PRIORITY_ROUTER_EXTENDED_ATTRIBUTE) def _after_router_create(self, resource, event, trigger, context, router_id, router, router_db, **kwargs): if not router['ha']: return try: self.schedule_router(context, router_id) router['ha_vr_id'] = router_db.extra_attributes.ha_vr_id self._notify_router_updated(context, router_id) except Exception as e: with excutils.save_and_reraise_exception() as ctx: if isinstance(e, l3ha_exc.NoVRIDAvailable): ctx.reraise = False LOG.warning("No more VRIDs for router: %s", e) else: LOG.exception("Failed to schedule HA router %s.", router_id) router['status'] = self._update_router_db( context, router_id, {'status': constants.ERROR})['status'] ``` seems to not respecting this option. So IMO auto_schedule_router might better be renamed to something like `fix_dangling_routers` etc and could be turned off if user wants to fix wrong routers manually. The reason is that could router by agent is pretty expensive for a relatively large deployment with around 10,000 routers. ** Affects: neutron Importance: Undecided Status: New ** Description changed: I found meaning of option "router_auto_schedule" is hard to follow. A - quick code review finds it is only used at + quick code review finds it is only used at (tests excluded) ```python - def get_router_ids(self, context, host): - """Returns IDs of routers scheduled to l3 agent on + def get_router_ids(self, context, host): + """Returns IDs of routers scheduled to l3 agent on - This will autoschedule unhosted routers to l3 agent on and then - return all ids of routers scheduled to it. - """ - if extensions.is_extension_supported( - self.l3plugin, constants.L3_AGENT_SCHEDULER_EXT_ALIAS): - if cfg.CONF.router_auto_schedule: - self.l3plugin.auto_schedule_routers(context, host) - return self.l3plugin.list_router_ids_on_host(context, host) + This will autoschedule unhosted routers to l3 agent on and then + return all ids of routers scheduled to it. + """ + if extensions.is_extension_supported( + self.l3plugin, constants.L3_AGENT_SCHEDULER_EXT_ALIAS): + if cfg.CONF.router_auto_schedule: + self.l3plugin.auto_schedule_routers(context, host) + return self.l3plugin.list_router_ids_on_host(context, host) ``` which seems to be fixing router without agents associated with it. And even if I turn this option off, router is still able to be properly scheduled to agents. because ```python - @registry.receives(resources.ROUTER, [events.AFTER_CREATE], -priority_group.PRIORITY_ROUTER_EXTENDED_ATTRIBUTE) - def _after_router_create(self, resource, event, trigger, context, - router_id, router, router_db, **kwargs): - if not router['ha']: - return - try: - self.schedule_router(context, router_id) - router['ha_vr_id'] = router_db.extra_attributes.ha_vr_id - self._notify_router_updated(context, router_id) - except Exception as e: - with excutils.save_and_reraise_exception() as ctx: - if isinstance(e, l3ha_exc.NoVRIDAvailable): - ctx.reraise = False - LOG.warning("No more VRIDs for router: %s", e) - else: - LOG.exception("Failed to schedule HA router %s.", - router_id) - router['status'] = self._update_router_db( - context, router_id, -
[Yahoo-eng-team] [Bug 1973576] [NEW] remove eager subquery load for DistributedPortBinding
Public bug reported: We observe excessive DB calls to load DistributedPortBindings, We have enabled DVR and have some huge virtual routers with around 60 router interfaces scheduled on around 200 compute nodes. We saw something like ```console 2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server Traceback (most recent call last): 2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.6/site-packages/sqlalchemy/engine/base.py", line 1193, in _execute_context 2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server context) 2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.6/site-packages/sqlalchemy/engine/default.py", line 509, in do_execute 2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server cursor.execute(statement, parameters) 2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.6/site-packages/pymysql/cursors.py", line 170, in execute 2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server result = self._query(query) 2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.6/site-packages/pymysql/cursors.py", line 328, in _query 2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server conn.query(q) 2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.6/site-packages/pymysql/connections.py", line 516, in query 2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server self._affected_rows = self._read_query_result(unbuffered=unbuffered) 2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.6/site-packages/pymysql/connections.py", line 727, in _read_query_result 2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server result.read() 2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.6/site-packages/pymysql/connections.py", line 1073, in read 2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server self._read_result_packet(first_packet) 2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.6/site-packages/pymysql/connections.py", line 1143, in _read_result_packet 2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server self._read_rowdata_packet() 2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.6/site-packages/pymysql/connections.py", line 1177, in _read_rowdata_packet 2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server packet = self.connection._read_packet() 2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.6/site-packages/pymysql/connections.py", line 673, in _read_packet 2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server recv_data = self._read_bytes(bytes_to_read) 2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.6/site-packages/pymysql/connections.py", line 702, in _read_bytes 2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server CR.CR_SERVER_LOST, "Lost connection to MySQL server during query") 2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query') 2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server 2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server The above exception was the direct cause of the following exception: 2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server 2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server Traceback (most recent call last): 2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 166, in _process_incoming 2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message) 2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 265, in dispatch 2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args) 2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 194, in _do_dispatch 2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args) 2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.6/site-packages/neutron/api/rpc/handlers/l3_rpc.py", line 104, in get_router_ids 2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server self.l3plugin.auto_schedule_routers(context, host) 2022-05-12 05:59:06.406 50 ERROR oslo_messaging.rpc.server File
[Yahoo-eng-team] [Bug 1968837] [NEW] too many l3 dvr agents got notifications after a server got deleted
Public bug reported: We are using Rocky 13.0.6 neutron which seems removing router namespace if retry limit got hit. After some investigations, it seems that delete a server which already associates with a floating ip address seems causes a broadcast notification to all related routers. In our cases, we have around 300 compute nodes and they all have l3 dvr agents running on. the related code snippet is https://github.com/openstack/neutron/blob/bb4c26eb7245465bf7cea7e0f07342601eb78ede/neutron/db/l3_db.py#L1999, so my question is: is it still relevant to have it if dvr is enabled? ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1968837 Title: too many l3 dvr agents got notifications after a server got deleted Status in neutron: New Bug description: We are using Rocky 13.0.6 neutron which seems removing router namespace if retry limit got hit. After some investigations, it seems that delete a server which already associates with a floating ip address seems causes a broadcast notification to all related routers. In our cases, we have around 300 compute nodes and they all have l3 dvr agents running on. the related code snippet is https://github.com/openstack/neutron/blob/bb4c26eb7245465bf7cea7e0f07342601eb78ede/neutron/db/l3_db.py#L1999, so my question is: is it still relevant to have it if dvr is enabled? To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1968837/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1964587] Re: default video driver
** Changed in: nova Status: Incomplete => Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1964587 Title: default video driver Status in OpenStack Compute (nova): Invalid Bug description: Hello, I saw on amd64 platform nova defaults to use cirrus as video driver and windows virtual machine got a small resolution. video driver virtio could allow a larger resolution. And looks like the driver type cannot be set by user. My question is why using cirrus as default and do we have plan to adopt virtio? thank you. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1964587/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1964587] [NEW] default video driver
Public bug reported: Hello, I saw on amd64 platform nova defaults to use cirrus as video driver and windows virtual machine got a small resolution. video driver virtio could allow a larger resolution. And looks like the driver type cannot be set by user. My question is why using cirrus as default and do we have plan to adopt virtio? thank you. ** Affects: nova Importance: Undecided Status: Incomplete ** Changed in: nova Status: New => Incomplete -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1964587 Title: default video driver Status in OpenStack Compute (nova): Incomplete Bug description: Hello, I saw on amd64 platform nova defaults to use cirrus as video driver and windows virtual machine got a small resolution. video driver virtio could allow a larger resolution. And looks like the driver type cannot be set by user. My question is why using cirrus as default and do we have plan to adopt virtio? thank you. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1964587/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1954619] Re: device_name is too narrow
Thank you, I saw a patch has been merged upstream for new releases. and this should be fixed. ** Changed in: horizon Status: New => Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Dashboard (Horizon). https://bugs.launchpad.net/bugs/1954619 Title: device_name is too narrow Status in OpenStack Dashboard (Horizon): Invalid Bug description: Horizon will auto fill a device_name "vda" by default. But vda only makes senses to virtio-blk block device. For scsi device, sda makes more sense. Nova will take care of device name if not specified, so why not make this field null by default and let nova chose a better device_name instead? To manage notifications about this bug go to: https://bugs.launchpad.net/horizon/+bug/1954619/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1954619] [NEW] device_name is too narrow
Public bug reported: Horizon will auto fill a device_name "vda" by default. But vda only makes senses to virtio-blk block device. For scsi device, sda makes more sense. Nova will take care of device name if not specified, so why not make this field null by default and let nova chose a better device_name instead? ** Affects: horizon Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Dashboard (Horizon). https://bugs.launchpad.net/bugs/1954619 Title: device_name is too narrow Status in OpenStack Dashboard (Horizon): New Bug description: Horizon will auto fill a device_name "vda" by default. But vda only makes senses to virtio-blk block device. For scsi device, sda makes more sense. Nova will take care of device name if not specified, so why not make this field null by default and let nova chose a better device_name instead? To manage notifications about this bug go to: https://bugs.launchpad.net/horizon/+bug/1954619/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1953718] [NEW] nova compute failed to update placement if mdev max available is 0
Public bug reported: Description === nova compute will failed to update vgpu mdev placement data if mdev type changed while there are some previously created mdev devices with different types. For nvidia, under such circumstances max available instances will be 0. Steps to reproduce == configure vgpu type to nvida-231 at first, boot one instance then change vgpu type to nvida-233 and reboot nova-compute service then it will failed to update placement Expected result === better observability, for example refuse to start nova-compute service or better logging to help operator understand the possible cause. Actual result = 2021-12-09 07:18:13.774 632001 ERROR nova.scheduler.client.report [None req-d717a248-4d90-4262-bf8b-11875c60aea6 - - - - -] [req-03944f1d-79bb-4d2f-b37a-99db24d78653] Failed to update inventory to [{'VGPU': {'total': 0, 'min_unit': 1, 'step_size': 1, 'reserved': 0, 'allocation_ratio': 1.0, 'max_unit': 0}}] for resource provider with UUID 9b6dd7c7-50c8-4780-b343-4c2e65dd0c67. Got 400: {"errors": [{"status": 400, "title": "Bad Request", "detail": "The server could not comply with the request since it is either malformed or otherwise incorrect.\n\n JSON does not validate: 0 is less than the minimum of 1 Failed validating 'minimum' in schema['properties']['inventories']['patternProperties']['^[A-Z0-9_]+$']['properties']['total']: {'maximum': 2147483647, 'minimum': 1, 'type': 'integer'} On instance['inventories']['VGPU']['total']: 0 ", "code": "placement.undefined_code", "request_id": "req-03944f1d-79bb-4d2f-b37a-99db24d78653"}]} 2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager [None req-d717a248-4d90-4262-bf8b-11875c60aea6 - - - - -] Error updating resources for node compute-009.: nova.exception.ResourceProviderSyncFailed: Failed to synchronize the placement service with resource provider information supplied by the compute host. 2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager Traceback (most recent call last): 2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager File "/var/lib/openstack/lib/python3.8/site-packages/nova/scheduler/client/report.py", line 1342, in catch_all 2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager yield 2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager File "/var/lib/openstack/lib/python3.8/site-packages/nova/scheduler/client/report.py", line 1430, in update_from_provider_tree 2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager self.set_inventory_for_provider( 2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager File "/var/lib/openstack/lib/python3.8/site-packages/nova/scheduler/client/report.py", line 951, in set_inventory_for_provider 2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager raise exception.ResourceProviderUpdateFailed(url=url, error=resp.text) 2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager nova.exception.ResourceProviderUpdateFailed: Failed to update resource provider via URL /resource_providers/9b6dd7c7-50c8-4780-b343-4c2e65dd0c67/inventories: {"errors": [{"status": 400, "title": "Bad Request", "detail": "The server could not comply with the request since it is either malformed or otherwise incorrect.\n\n JSON does not validate: 0 is less than the minimum of 1 Failed validating 'minimum' in schema['properties']['inventories']['patternProperties']['^[A-Z0-9_]+$']['properties']['total']: {'maximum': 2147483647, 'minimum': 1, 'type': 'integer'} On instance['inventories']['VGPU']['total']: 0 ", "code": "placement.undefined_code", "request_id": "req-03944f1d-79bb-4d2f-b37a-99db24d78653"}]} 2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager 2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager During handling of the above exception, another exception occurred: 2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager 2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager Traceback (most recent call last): 2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager File "/var/lib/openstack/lib/python3.8/site-packages/nova/compute/manager.py", line 10293, in _update_available_resource_for_node 2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager self.rt.update_available_resource(context, nodename, 2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager File "/var/lib/openstack/lib/python3.8/site-packages/nova/compute/resource_tracker.py", line 910, in update_available_resource 2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager self._update_available_resource(context, resources, startup=startup) 2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager File "/var/lib/openstack/lib/python3.8/site-packages/oslo_concurrency/lockutils.py", line 360, in inner 2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager return f(*args, **kwargs) 2021-12-09 07:18:13.775 632001 ERROR nova.compute.manager File
[Yahoo-eng-team] [Bug 1946546] [NEW] nova-compute endlessly waits for snapshot completes
Public bug reported: Description === When trying to create a server image, nova compute will endlessly waits for snapshot to be created. this is quite dangerous because server's file system has already been frozen and IO operations has been disabled. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1946546 Title: nova-compute endlessly waits for snapshot completes Status in OpenStack Compute (nova): New Bug description: Description === When trying to create a server image, nova compute will endlessly waits for snapshot to be created. this is quite dangerous because server's file system has already been frozen and IO operations has been disabled. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1946546/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1940641] [NEW] nova compute with allocated vgpu device failed to start after host reboot
Public bug reported: Description = nova compute service failed to start after reboot, if there are vgpu virtual machines beforehand. Error log 2021-08-20 09:37:30.331 284159 DEBUG nova.virt.libvirt.volume.mount [None req-6ad4e06c-980e-4759-8b36-6c696e596dab - - - - -] Initialising _HostMountState generation 0 host_up /var/lib/openstack/lib/python3.8/site-packages/nova/virt/libvirt/volume/mount.py:131 2021-08-20 09:37:30.421 284159 ERROR oslo_service.service [-] Error starting thread.: libvirt.libvirtError: Node device not found: no node device with matching name 'mdev_74527849_d08c_4243_b868_f84a1437c9b5' 2021-08-20 09:37:30.421 284159 ERROR oslo_service.service Traceback (most recent call last): 2021-08-20 09:37:30.421 284159 ERROR oslo_service.service File "/var/lib/openstack/lib/python3.8/site-packages/oslo_service/service.py", line 807, in run_service 2021-08-20 09:37:30.421 284159 ERROR oslo_service.service service.start() 2021-08-20 09:37:30.421 284159 ERROR oslo_service.service File "/var/lib/openstack/lib/python3.8/site-packages/nova/service.py", line 159, in start 2021-08-20 09:37:30.421 284159 ERROR oslo_service.service self.manager.init_host() 2021-08-20 09:37:30.421 284159 ERROR oslo_service.service File "/var/lib/openstack/lib/python3.8/site-packages/nova/compute/manager.py", line 1414, in init_host 2021-08-20 09:37:30.421 284159 ERROR oslo_service.service self.driver.init_host(host=self.host) 2021-08-20 09:37:30.421 284159 ERROR oslo_service.service File "/var/lib/openstack/lib/python3.8/site-packages/nova/virt/libvirt/driver.py", line 733, in init_host 2021-08-20 09:37:30.421 284159 ERROR oslo_service.service self._recreate_assigned_mediated_devices() 2021-08-20 09:37:30.421 284159 ERROR oslo_service.service File "/var/lib/openstack/lib/python3.8/site-packages/nova/virt/libvirt/driver.py", line 862, in _recreate_assigned_mediated_devices 2021-08-20 09:37:30.421 284159 ERROR oslo_service.service dev_info = self._get_mediated_device_information(dev_name) 2021-08-20 09:37:30.421 284159 ERROR oslo_service.service File "/var/lib/openstack/lib/python3.8/site-packages/nova/virt/libvirt/driver.py", line 7380, in _get_mediated_device_information 2021-08-20 09:37:30.421 284159 ERROR oslo_service.service virtdev = self._host.device_lookup_by_name(devname) 2021-08-20 09:37:30.421 284159 ERROR oslo_service.service File "/var/lib/openstack/lib/python3.8/site-packages/nova/virt/libvirt/host.py", line 1153, in device_lookup_by_name 2021-08-20 09:37:30.421 284159 ERROR oslo_service.service return self.get_connection().nodeDeviceLookupByName(name) 2021-08-20 09:37:30.421 284159 ERROR oslo_service.service File "/var/lib/openstack/lib/python3.8/site-packages/eventlet/tpool.py", line 190, in doit 2021-08-20 09:37:30.421 284159 ERROR oslo_service.service result = proxy_call(self._autowrap, f, *args, **kwargs) 2021-08-20 09:37:30.421 284159 ERROR oslo_service.service File "/var/lib/openstack/lib/python3.8/site-packages/eventlet/tpool.py", line 148, in proxy_call 2021-08-20 09:37:30.421 284159 ERROR oslo_service.service rv = execute(f, *args, **kwargs) 2021-08-20 09:37:30.421 284159 ERROR oslo_service.service File "/var/lib/openstack/lib/python3.8/site-packages/eventlet/tpool.py", line 129, in execute 2021-08-20 09:37:30.421 284159 ERROR oslo_service.service six.reraise(c, e, tb) 2021-08-20 09:37:30.421 284159 ERROR oslo_service.service File "/var/lib/openstack/lib/python3.8/site-packages/six.py", line 703, in reraise 2021-08-20 09:37:30.421 284159 ERROR oslo_service.service raise value 2021-08-20 09:37:30.421 284159 ERROR oslo_service.service File "/var/lib/openstack/lib/python3.8/site-packages/eventlet/tpool.py", line 83, in tworker 2021-08-20 09:37:30.421 284159 ERROR oslo_service.service rv = meth(*args, **kwargs) 2021-08-20 09:37:30.421 284159 ERROR oslo_service.service File "/var/lib/openstack/lib/python3.8/site-packages/libvirt.py", line 4614, in nodeDeviceLookupByName 2021-08-20 09:37:30.421 284159 ERROR oslo_service.service if ret is None:raise libvirtError('virNodeDeviceLookupByName() failed', conn=self) 2021-08-20 09:37:30.421 284159 ERROR oslo_service.service libvirt.libvirtError: Node device not found: no node device with matching name 'mdev_74527849_d08c_4243_b868_f84a1437c9b5' 2021-08-20 09:37:30.421 284159 ERROR oslo_service.service Environment nova: victoria os ubuntu 20.04 Steps to Reproduce === create vgpu virtual machines (mdev) and then reboot host. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1940641 Title: nova compute with allocated vgpu device failed to start after host reboot Status in OpenStack Compute (nova):
[Yahoo-eng-team] [Bug 1940012] [NEW] allow attaching pci devices as different functions
Public bug reported: Description === We have a use case to attach FPGA device to virtual machine. This FPGA card gets two functions, we can attach both of them using alias. After both of them are passing through to the virtual machine, we found that they are not appearing as different functions of a same PCI device. Instead, they are two PCI devices as denoted by 'slot' ID. I think it should be possible to allow setting functions as libvirt allows it. ** Affects: nova Importance: Undecided Status: Opinion ** Changed in: nova Status: New => Opinion -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1940012 Title: allow attaching pci devices as different functions Status in OpenStack Compute (nova): Opinion Bug description: Description === We have a use case to attach FPGA device to virtual machine. This FPGA card gets two functions, we can attach both of them using alias. After both of them are passing through to the virtual machine, we found that they are not appearing as different functions of a same PCI device. Instead, they are two PCI devices as denoted by 'slot' ID. I think it should be possible to allow setting functions as libvirt allows it. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1940012/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1934203] [NEW] cannot multi attach enabled volume after swap volume
Public bug reported: Description === detach a multi-attach enabled volume failed after swapping volume. Steps to reproduce == 1. Create two volume type with multi attach enabled (A, B) 2. Create a new volume using type A 3. attach it a server 4. Retype this volume to type B 5. wait for it succeeds and detach it will cause a failure Expected result === volume should be successfully detached Actual result = failed because nova-compute uses a non existent volume id from connection info Environment === 1. openstack nova victoria 2. ubuntu 18.04 with docker image using 20.04 ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1934203 Title: cannot multi attach enabled volume after swap volume Status in OpenStack Compute (nova): New Bug description: Description === detach a multi-attach enabled volume failed after swapping volume. Steps to reproduce == 1. Create two volume type with multi attach enabled (A, B) 2. Create a new volume using type A 3. attach it a server 4. Retype this volume to type B 5. wait for it succeeds and detach it will cause a failure Expected result === volume should be successfully detached Actual result = failed because nova-compute uses a non existent volume id from connection info Environment === 1. openstack nova victoria 2. ubuntu 18.04 with docker image using 20.04 To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1934203/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1931209] [NEW] Circular reference detected during cold migration
Public bug reported: Description === cold migration failed when server is specified with a numa topology Steps to reproduce == create server from a flavor specified with numa topology parameters and then do a cold migrate or resize Expected success Actual == failed with following messages 2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server [req-c299c521-2a07-483b-b19e-deb136572da0 dde6a5842265470a8e2f40938ae66097 f3d6994dfaf043479c9cf5bbac19ab87 - default default] Exception during message handling: ValueError: Circular reference detected 2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server Traceback (most recent call last): 2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 166, in _process_incoming 2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message) 2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 265, in dispatch 2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args) 2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 194, in _do_dispatch 2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args) 2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 229, in inner 2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server return func(*args, **kwargs) 2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.6/site-packages/nova/conductor/manager.py", line 94, in wrapper 2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server return fn(self, context, *args, **kwargs) 2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.6/site-packages/nova/compute/utils.py", line 1164, in decorated_function 2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs) 2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.6/site-packages/nova/conductor/manager.py", line 298, in migrate_server 2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server host_list) 2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.6/site-packages/nova/conductor/manager.py", line 358, in _cold_migrate 2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server updates, ex, request_spec) 2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__ 2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server self.force_reraise() 2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb) 2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.6/site-packages/six.py", line 693, in reraise 2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server raise value 2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.6/site-packages/nova/conductor/manager.py", line 327, in _cold_migrate 2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server task.execute() 2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.6/site-packages/nova/conductor/tasks/base.py", line 27, in wrap 2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server self.rollback() 2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__ 2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server self.force_reraise() 2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise 2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb) 2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.6/site-packages/six.py", line 693, in reraise 2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server raise value 2021-06-08 04:24:52.963 19 ERROR oslo_messaging.rpc.server File "/var/lib/openstack/lib/python3.6/site-packages/nova/conductor/tasks/base.py", line
[Yahoo-eng-team] [Bug 1929480] [NEW] cloud-init for ubuntu 18.04
Public bug reported: ubuntu 18.04 uses netplan to manage networks, netplan could either use NetworkManager or systemd-networkd internally, but it does not use networking. cloud-init.service explicitly depends on networking.service to complete which might be problematic because network service might not get ready.. ** Affects: cloud-init Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to cloud-init. https://bugs.launchpad.net/bugs/1929480 Title: cloud-init for ubuntu 18.04 Status in cloud-init: New Bug description: ubuntu 18.04 uses netplan to manage networks, netplan could either use NetworkManager or systemd-networkd internally, but it does not use networking. cloud-init.service explicitly depends on networking.service to complete which might be problematic because network service might not get ready.. To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-init/+bug/1929480/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1927747] [NEW] neutron ovs agent apply openvswitch security group slow
Public bug reported: I am using neutron-ovs-agent using openvswitch firewall, there are around 40 ports with same security group on the same compute node. it seems update security group for each port will consume near 3 seconds which sums up to around 100 seconds in total. This significantly affects the speed of spawning virtual machines. ** Affects: neutron Importance: Undecided Status: New ** Description changed: I am using neutron-ovs-agent using openvswitch firewall, there are around 40 ports with same security group on the same compute node. it seems update security group for each port will consume near 3 seconds which sums up to around 100 seconds in total. This significantly affects - the speed of creating new ports. + the speed of spawning virtual machines. -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1927747 Title: neutron ovs agent apply openvswitch security group slow Status in neutron: New Bug description: I am using neutron-ovs-agent using openvswitch firewall, there are around 40 ports with same security group on the same compute node. it seems update security group for each port will consume near 3 seconds which sums up to around 100 seconds in total. This significantly affects the speed of spawning virtual machines. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1927747/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1926049] [NEW] check_changed_vlans failed
Public bug reported: 2021-04-25 03:19:37.303 1 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [None req-413ff802-0c14-47ad-8221-14d7e972bad3 - - - - -] Error while processing VIF ports: TypeError: %d format: a number is required, not list 2021-04-25 03:19:37.303 1 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent Traceback (most recent call last): 2021-04-25 03:19:37.303 1 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/lib/python3.8/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 2658, in rpc_loop 2021-04-25 03:19:37.303 1 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent ports_not_ready_yet) = (self.process_port_info( 2021-04-25 03:19:37.303 1 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/lib/python3.8/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 2453, in process_port_info 2021-04-25 03:19:37.303 1 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent port_info = self.scan_ports(reg_ports, sync, 2021-04-25 03:19:37.303 1 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/lib/python3.8/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 1764, in scan_ports 2021-04-25 03:19:37.303 1 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent updated_ports.update(self.check_changed_vlans()) 2021-04-25 03:19:37.303 1 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/lib/python3.8/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py", line 1795, in check_changed_vlans 2021-04-25 03:19:37.303 1 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent LOG.info( 2021-04-25 03:19:37.303 1 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python3.8/logging/__init__.py", line 1794, in info 2021-04-25 03:19:37.303 1 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent self.log(INFO, msg, *args, **kwargs) 2021-04-25 03:19:37.303 1 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python3.8/logging/__init__.py", line 1832, in log 2021-04-25 03:19:37.303 1 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent self.logger.log(level, msg, *args, **kwargs) 2021-04-25 03:19:37.303 1 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python3.8/logging/__init__.py", line 1500, in log 2021-04-25 03:19:37.303 1 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent self._log(level, msg, args, **kwargs) 2021-04-25 03:19:37.303 1 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python3.8/logging/__init__.py", line 1577, in _log 2021-04-25 03:19:37.303 1 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent self.handle(record) 2021-04-25 03:19:37.303 1 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python3.8/logging/__init__.py", line 1587, in handle 2021-04-25 03:19:37.303 1 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent self.callHandlers(record) 2021-04-25 03:19:37.303 1 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python3.8/logging/__init__.py", line 1649, in callHandlers 2021-04-25 03:19:37.303 1 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent hdlr.handle(record) 2021-04-25 03:19:37.303 1 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python3.8/logging/__init__.py", line 950, in handle 2021-04-25 03:19:37.303 1 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent self.emit(record) 2021-04-25 03:19:37.303 1 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/lib/python3.8/site-packages/fluent/handler.py", line 237, in emit 2021-04-25 03:19:37.303 1 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent data = self.format(record) 2021-04-25 03:19:37.303 1 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/usr/lib/python3.8/logging/__init__.py", line 925, in format 2021-04-25 03:19:37.303 1 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent return fmt.format(record) 2021-04-25 03:19:37.303 1 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent File "/var/lib/openstack/lib/python3.8/site-packages/oslo_log/formatters.py", line 315, in format 2021-04-25 03:19:37.303 1 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent message = {'message':
[Yahoo-eng-team] [Bug 1925144] [NEW] timeout in rados connect does not take effect
Public bug reported: Description === from https://github.com/ceph/ceph/blob/0be78da368f2dc1c891e3caafac38f7aa96d3c49/src/pybind/rados/rados.pyx#L660, it looks like function connect in object rados will ignore timeout input and therefore makes current configuration does not take effect. Steps to reproduce == just configure rbd to use a non-existing ip address and rados.connect will hang there. Expected result === timeout should take effect Actual result = when there is a problem on network, it will hang more than timeout configured ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1925144 Title: timeout in rados connect does not take effect Status in OpenStack Compute (nova): New Bug description: Description === from https://github.com/ceph/ceph/blob/0be78da368f2dc1c891e3caafac38f7aa96d3c49/src/pybind/rados/rados.pyx#L660, it looks like function connect in object rados will ignore timeout input and therefore makes current configuration does not take effect. Steps to reproduce == just configure rbd to use a non-existing ip address and rados.connect will hang there. Expected result === timeout should take effect Actual result = when there is a problem on network, it will hang more than timeout configured To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1925144/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1925143] [NEW] timeout in rados connect does not take effect
Public bug reported: Description === from https://github.com/ceph/ceph/blob/0be78da368f2dc1c891e3caafac38f7aa96d3c49/src/pybind/rados/rados.pyx#L660, it looks like function connect in object rados will ignore timeout input and therefore makes current configuration does not take effect. Steps to reproduce == just configure rbd to use a non-existing ip address and rados.connect will hang there. Expected result === timeout should take effect Actual result = when there is a problem on network, it will hang more than timeout configured ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1925143 Title: timeout in rados connect does not take effect Status in OpenStack Compute (nova): New Bug description: Description === from https://github.com/ceph/ceph/blob/0be78da368f2dc1c891e3caafac38f7aa96d3c49/src/pybind/rados/rados.pyx#L660, it looks like function connect in object rados will ignore timeout input and therefore makes current configuration does not take effect. Steps to reproduce == just configure rbd to use a non-existing ip address and rados.connect will hang there. Expected result === timeout should take effect Actual result = when there is a problem on network, it will hang more than timeout configured To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1925143/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1923560] [NEW] retrieving security group is slow for server detail
Public bug reported: Description === querying large number of vms through server detail is slow, and a lot of time is wasted on calling neutron api to obtain security group info. Expected result === obtaining security group info should not consumes half of total query time Actual result = too slow... Environment === 1. ubuntu 18.04 + nova 22 2. libvirt + qemu + kvm 2. ceph 3. vxlan + vlan ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1923560 Title: retrieving security group is slow for server detail Status in OpenStack Compute (nova): New Bug description: Description === querying large number of vms through server detail is slow, and a lot of time is wasted on calling neutron api to obtain security group info. Expected result === obtaining security group info should not consumes half of total query time Actual result = too slow... Environment === 1. ubuntu 18.04 + nova 22 2. libvirt + qemu + kvm 2. ceph 3. vxlan + vlan To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1923560/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1922222] [NEW] allow using tap device on netdev enabled host
Public bug reported: hello, after reading the code, it seems nova-compute can only use vhostuser mode if netdev is enabled on ovs bridge. an internal use case requires us to allow using tap device as well as vhostuser device on the same host. Do this sound like a valid use case? ** Affects: neutron Importance: Undecided Status: Opinion ** Changed in: neutron Status: New => Opinion -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/192 Title: allow using tap device on netdev enabled host Status in neutron: Opinion Bug description: hello, after reading the code, it seems nova-compute can only use vhostuser mode if netdev is enabled on ovs bridge. an internal use case requires us to allow using tap device as well as vhostuser device on the same host. Do this sound like a valid use case? To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/192/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1921804] [NEW] leftover bdm when rabbitmq unstable
Public bug reported: Description === When rabbitMQ unstable, there might be a chance that method https://github.com/openstack/nova/blob/7a1222a8654684262a8e589d91e67f2b9a9da336/nova/compute/api.py#L4741 will timeout but bdm is successfully created. Under such cases, volume will be shown in server show, but cannot be detached. and volume status is available. Steps to reproduce == there might be no way to safely reproduce this failure, because when rabbitmq is unstable, many other services will also show unusual behavior. Expected result === We should be able to remove such attachment from api without manually fixing db... ```console root@mgt02:~# openstack server show 4e5c3c7d-6b4c-4841-9e6e-9a3374036a3e +-+---+ | Field | Value | +-+---+ | OS-DCF:diskConfig | MANUAL | | OS-EXT-AZ:availability_zone | cn-north-3a | | OS-EXT-SRV-ATTR:host| compute01 | | OS-EXT-SRV-ATTR:hypervisor_hostname | compute01 | | OS-EXT-SRV-ATTR:instance_name | instance-ce4c | | OS-EXT-STS:power_state | Running | | OS-EXT-STS:task_state | None | | OS-EXT-STS:vm_state | active | | OS-SRV-USG:launched_at | 2021-03-29T09:06:38.00 | | OS-SRV-USG:terminated_at| None | | accessIPv4 | | | accessIPv6 | | | addresses | newsql-net=192.168.1.217; service_mgt=100.114.3.41| | config_drive| True | | created | 2021-03-29T09:05:19Z | | flavor | newsql_2C8G40G_general (51db3192-cece-4b9a-9969-7916b4543beb) | | hostId | cf1f3937a3286677b3020d817541ac33d7c8f1ca74be49b26f128093 | | id | 4e5c3c7d-6b4c-4841-9e6e-9a3374036a3e | | image | newsql-bini2.0.0alpha-ubuntu18.04-x64-20210112-pub (4531e3bf-0433-40c6-816b-6763f9d02c7a) | | key_name| None | | name| NewSQL-1abc5b28-b9e6-45cd-893d-5bb3a7732a43-3 | | progress| 0 | | project_id | acfcc87fc1db430880f0cb1cce410906 | | properties | productTag='NewSQL' | | security_groups | name='default' | | | name='csf-NewSQL-cluster-security-group' | | status | ACTIVE | | updated | 2021-03-29T09:06:39Z | | user_id | a38ef24677cc4a45a143a31c5fb59ee9
[Yahoo-eng-team] [Bug 1914522] [NEW] migrate from iptables firewall to ovs firewall
Public bug reported: Sorry this is actually a bug report but discussing for better clarification in document. Currently, we are running iptables firewall in production and saw performance degrade thus we plan to upgrade to ovs firewall in place. By reading the doc I found upgrading process is described here https://docs.openstack.org/neutron/latest/contributor/internals/openvswitch_firewall.html#upgrade-path-from-iptables-hybrid-driver. it does provide three methods to allow upgrade the existing cluster. I am interested in method 2 which quotes "plug the tap device into the integration bridge", since it does not provide the command so I would like to ask how to actually perform it. I tried with ```console # brctl delif qbrxxx tapxxx # ovs-vsctl add-port br-int tapxxx ``` but it does not work because network appears to be disconnected. Another question is that is there an option 4, such that ovs firewall could takes control of existing iptables firewalled port and later users could transition to ovs firewalls gradually. Thank you. ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1914522 Title: migrate from iptables firewall to ovs firewall Status in neutron: New Bug description: Sorry this is actually a bug report but discussing for better clarification in document. Currently, we are running iptables firewall in production and saw performance degrade thus we plan to upgrade to ovs firewall in place. By reading the doc I found upgrading process is described here https://docs.openstack.org/neutron/latest/contributor/internals/openvswitch_firewall.html#upgrade-path-from-iptables-hybrid-driver. it does provide three methods to allow upgrade the existing cluster. I am interested in method 2 which quotes "plug the tap device into the integration bridge", since it does not provide the command so I would like to ask how to actually perform it. I tried with ```console # brctl delif qbrxxx tapxxx # ovs-vsctl add-port br-int tapxxx ``` but it does not work because network appears to be disconnected. Another question is that is there an option 4, such that ovs firewall could takes control of existing iptables firewalled port and later users could transition to ovs firewalls gradually. Thank you. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1914522/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1910946] [NEW] ovs is dead but ovs agent is up
Public bug reported: we are using openstack-neutron rocky with openvswitch versioned 2.10.0 We are using ubuntu 18.04 which shipped with a libc6 bug, reported here https://github.com/openvswitch/ovs-issues/issues/175. My question is that when this bug happens ovs agent will not working and reported dead observed from log, but it still reports hearthbeat to neutron-server which is problematic because users will be unaware that ovs-agent is working anymore but looking at agent service state. ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1910946 Title: ovs is dead but ovs agent is up Status in neutron: New Bug description: we are using openstack-neutron rocky with openvswitch versioned 2.10.0 We are using ubuntu 18.04 which shipped with a libc6 bug, reported here https://github.com/openvswitch/ovs-issues/issues/175. My question is that when this bug happens ovs agent will not working and reported dead observed from log, but it still reports hearthbeat to neutron-server which is problematic because users will be unaware that ovs-agent is working anymore but looking at agent service state. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1910946/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1910947] [NEW] ovs is dead but ovs agent is up
Public bug reported: we are using openstack-neutron rocky with openvswitch versioned 2.10.0 We are using ubuntu 18.04 which shipped with a libc6 bug, reported here https://github.com/openvswitch/ovs-issues/issues/175. My question is that when this bug happens ovs agent will not working and reported dead observed from log, but it still reports hearthbeat to neutron-server which is problematic because users will be unaware that ovs-agent is working anymore but looking at agent service state. ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1910947 Title: ovs is dead but ovs agent is up Status in neutron: New Bug description: we are using openstack-neutron rocky with openvswitch versioned 2.10.0 We are using ubuntu 18.04 which shipped with a libc6 bug, reported here https://github.com/openvswitch/ovs-issues/issues/175. My question is that when this bug happens ovs agent will not working and reported dead observed from log, but it still reports hearthbeat to neutron-server which is problematic because users will be unaware that ovs-agent is working anymore but looking at agent service state. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1910947/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1909160] Re: high cpu usage when listing security groups
Ok, i'll try out Victoria and compare the result. thank you for reply. ** Changed in: neutron Status: New => Opinion -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1909160 Title: high cpu usage when listing security groups Status in neutron: Opinion Bug description: I saw listing security group is slow and causing cpu spikes unexpectedly, I run a Rock neutron-server with api worker set to 1, when executing command like ```console root@mgt01:~# time curl -H "x-auth-token: $token" http://neutron-server.openstack.svc.region-stackdev.myinspurcloud.com:9696/v2.0/security_groups -owide -s real 0m2.328s user 0m0.016s sys 0m0.012s root@mgt01:~# curl -H "x-auth-token: $token" http://neutron-server.openstack.svc.region-stackdev.myinspurcloud.com:9696/v2.0/security_groups | jq '.security_groups | length' % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 100 497k 100 497k0 0 219k 0 0:00:02 0:00:02 --:--:-- 219k 225 ``` It will return in around 2 seconds. There are around 200 security groups, so maybe it is not extremely slow, but what is interesting is that calling this rest api seems making cpu spikes for neutorn-server pod, ```console CONTAINER IDNAME CPU % MEM USAGE / LIMITMEM % NET I/O BLOCK I/O PIDS 8a30733e3932 k8s_neutron-server_neutron-server-787dcd7964-2zxt5_openstack_71cbb9bc-4530-11eb-bcc6-525400d22fc9_0 92.83% 1020MiB / 2.441GiB 40.81% 0B / 0B 0B / 16.4kB 8 ``` I am wondering why security group listing is cpu bound? To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1909160/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1909160] [NEW] high cpu usage when listing security groups
Public bug reported: I saw listing security group is slow and causing cpu spikes unexpectedly, I run a Rock neutron-server with api worker set to 1, when executing command like ```console root@mgt01:~# time curl -H "x-auth-token: $token" http://neutron-server.openstack.svc.region-stackdev.myinspurcloud.com:9696/v2.0/security_groups -owide -s real0m2.328s user0m0.016s sys 0m0.012s root@mgt01:~# curl -H "x-auth-token: $token" http://neutron-server.openstack.svc.region-stackdev.myinspurcloud.com:9696/v2.0/security_groups | jq '.security_groups | length' % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 100 497k 100 497k0 0 219k 0 0:00:02 0:00:02 --:--:-- 219k 225 ``` It will return in around 2 seconds. There are around 200 security groups, so maybe it is not extremely slow, but what is interesting is that calling this rest api seems making cpu spikes for neutorn-server pod, ```console CONTAINER IDNAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS 8a30733e3932 k8s_neutron-server_neutron-server-787dcd7964-2zxt5_openstack_71cbb9bc-4530-11eb-bcc6-525400d22fc9_0 92.83% 1020MiB / 2.441GiB 40.81% 0B / 0B 0B / 16.4kB 8 ``` I am wondering why security group listing is cpu bound? ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1909160 Title: high cpu usage when listing security groups Status in neutron: New Bug description: I saw listing security group is slow and causing cpu spikes unexpectedly, I run a Rock neutron-server with api worker set to 1, when executing command like ```console root@mgt01:~# time curl -H "x-auth-token: $token" http://neutron-server.openstack.svc.region-stackdev.myinspurcloud.com:9696/v2.0/security_groups -owide -s real 0m2.328s user 0m0.016s sys 0m0.012s root@mgt01:~# curl -H "x-auth-token: $token" http://neutron-server.openstack.svc.region-stackdev.myinspurcloud.com:9696/v2.0/security_groups | jq '.security_groups | length' % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 100 497k 100 497k0 0 219k 0 0:00:02 0:00:02 --:--:-- 219k 225 ``` It will return in around 2 seconds. There are around 200 security groups, so maybe it is not extremely slow, but what is interesting is that calling this rest api seems making cpu spikes for neutorn-server pod, ```console CONTAINER IDNAME CPU % MEM USAGE / LIMITMEM % NET I/O BLOCK I/O PIDS 8a30733e3932 k8s_neutron-server_neutron-server-787dcd7964-2zxt5_openstack_71cbb9bc-4530-11eb-bcc6-525400d22fc9_0 92.83% 1020MiB / 2.441GiB 40.81% 0B / 0B 0B / 16.4kB 8 ``` I am wondering why security group listing is cpu bound? To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1909160/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1908957] [NEW] iptable rules collision deployed with k8s iptables kube-proxy enabled
Public bug reported: Maybe it's a k8s kube-proxy related bug, but maybe it is easier to solve on neutron's side... In k8s either NodePort or ExternalIP will generate iptable rules which will effect vm traffic when hybrid iptable plugin enabled. The problem is: Chain PREROUTING (policy ACCEPT 650 packets, 65873 bytes) pkts bytes target prot opt in out source destination 560K 37M ACCEPT all -- * * 0.0.0.0/00.0.0.0/0 PHYSDEV match --physdev-is-in 56M 4944M KUBE-SERVICES all -- * * 0.0.0.0/0 0.0.0.0/0/* kubernetes service portals */ 40M 3785M KUBE-SERVICES all -- * * 0.0.0.0/0 0.0.0.0/0/* kubernetes service portals */ 40M 3785M KUBE-SERVICES all -- * * 0.0.0.0/0 0.0.0.0/0/* kubernetes service portals */ 40M 3785M KUBE-SERVICES all -- * * 0.0.0.0/0 0.0.0.0/0/* kubernetes service portals */ 40M 3785M KUBE-SERVICES all -- * * 0.0.0.0/0 0.0.0.0/0/* kubernetes service portals */ 40M 3785M KUBE-SERVICES all -- * * 0.0.0.0/0 0.0.0.0/0/* kubernetes service portals */ 40M 3785M KUBE-SERVICES all -- * * 0.0.0.0/0 0.0.0.0/0/* kubernetes service portals */ 40M 3785M KUBE-SERVICES all -- * * 0.0.0.0/0 0.0.0.0/0/* kubernetes service portals */ 40M 3785M KUBE-SERVICES all -- * * 0.0.0.0/0 0.0.0.0/0/* kubernetes service portals */ 40M 3785M KUBE-SERVICES all -- * * 0.0.0.0/0 0.0.0.0/0/* kubernetes service portals */ 40M 3785M KUBE-SERVICES all -- * * 0.0.0.0/0 0.0.0.0/0/* kubernetes service portals */ And packets will be DNAT to something which we do not want and such traffic will be dropped in the end. By adding the following rule it seems problem is mitigated, iptables -t nat -I PREROUTING 2 -m physdev --physdev-is-in -j ACCEPT ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1908957 Title: iptable rules collision deployed with k8s iptables kube-proxy enabled Status in neutron: New Bug description: Maybe it's a k8s kube-proxy related bug, but maybe it is easier to solve on neutron's side... In k8s either NodePort or ExternalIP will generate iptable rules which will effect vm traffic when hybrid iptable plugin enabled. The problem is: Chain PREROUTING (policy ACCEPT 650 packets, 65873 bytes) pkts bytes target prot opt in out source destination 560K 37M ACCEPT all -- * * 0.0.0.0/00.0.0.0/0 PHYSDEV match --physdev-is-in 56M 4944M KUBE-SERVICES all -- * * 0.0.0.0/0 0.0.0.0/0/* kubernetes service portals */ 40M 3785M KUBE-SERVICES all -- * * 0.0.0.0/0 0.0.0.0/0/* kubernetes service portals */ 40M 3785M KUBE-SERVICES all -- * * 0.0.0.0/0 0.0.0.0/0/* kubernetes service portals */ 40M 3785M KUBE-SERVICES all -- * * 0.0.0.0/0 0.0.0.0/0/* kubernetes service portals */ 40M 3785M KUBE-SERVICES all -- * * 0.0.0.0/0 0.0.0.0/0/* kubernetes service portals */ 40M 3785M KUBE-SERVICES all -- * * 0.0.0.0/0 0.0.0.0/0/* kubernetes service portals */ 40M 3785M KUBE-SERVICES all -- * * 0.0.0.0/0 0.0.0.0/0/* kubernetes service portals */ 40M 3785M KUBE-SERVICES all -- * * 0.0.0.0/0 0.0.0.0/0/* kubernetes service portals */ 40M 3785M KUBE-SERVICES all -- * * 0.0.0.0/0 0.0.0.0/0/* kubernetes service portals */ 40M 3785M KUBE-SERVICES all -- * * 0.0.0.0/0 0.0.0.0/0/* kubernetes service portals */ 40M 3785M KUBE-SERVICES all -- * * 0.0.0.0/0 0.0.0.0/0/* kubernetes service portals */ And packets will be DNAT to something which we do not want and such traffic will be dropped in the end. By adding the following rule it seems problem is mitigated, iptables -t nat -I PREROUTING 2 -m physdev --physdev-is-in -j ACCEPT To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1908957/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More
[Yahoo-eng-team] [Bug 1902806] [NEW] only 7 iscsi disk could be attached
Public bug reported: for libvirt version 4.0.0, scsi disk with an unit equal to 7 will not be able to attach due to libvirt's own limitation. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1902806 Title: only 7 iscsi disk could be attached Status in OpenStack Compute (nova): New Bug description: for libvirt version 4.0.0, scsi disk with an unit equal to 7 will not be able to attach due to libvirt's own limitation. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1902806/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1901124] [NEW] memcached cache not get expired
Public bug reported: We are using openstack rocky and we I check the memcached, I found root@compute:~# telnet compute 11211 Trying 192.168.0.17... Connected to compute. Escape character is '^]'. stats cachedump 15 1 ITEM c9067b617ec1e6e7f78318c19e7ce2c7f4f9dcd6 [2034 b; 0 s] expiration time not setup for the keys. even I set all of the cache_time options, it still not get changed. [identity] password_hash_rounds = 4 driver = sql [assignment] driver = sql [catalog] cache_time = 300 [role] driver = sql cache_time = 300 [resource] driver = sql cache_time = 300 [application_credential] cache_time = 300 [oslo.cache] expiration_time = 300 [cache] memcache_servers = compute:11211 backend = dogpile.cache.memcached enabled = true expiration_time = 300 cache_time = 300 [oslo_messaging_notifications] transport_url = rabbit://stackrabbit:secret@192.168.0.5:5672/ [DEFAULT] max_token_size = 16384 debug = True logging_exception_prefix = ERROR %(name)s %(instance)s logging_default_format_string = %(color)s%(levelname)s %(name)s [-%(color)s] %(instance)s%(color)s%(message)s logging_context_format_string = %(color)s%(levelname)s %(name)s [%(global_request_id)s %(request_id)s %(project_name)s %(user_name)s%(color)s] %(instance)s%(color)s%(message)s logging_debug_format_suffix = {{(pid=%(process)d) %(funcName)s %(pathname)s:%(lineno)d}} admin_endpoint = http://192.168.0.5/identity public_endpoint = http://192.168.0.5/identity [token] provider = fernet cache_time = 300 [database] connection = mysql+pymysql://root:secret@127.0.0.1/keystone?charset=utf8 [fernet_tokens] key_repository = /etc/keystone/fernet-keys/ [credential] key_repository = /etc/keystone/credential-keys/ [security_compliance] unique_last_password_count = 2 lockout_duration = 10 lockout_failure_attempts = 2 [unified_limit] cache_time = 300 ** Affects: keystone Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Identity (keystone). https://bugs.launchpad.net/bugs/1901124 Title: memcached cache not get expired Status in OpenStack Identity (keystone): New Bug description: We are using openstack rocky and we I check the memcached, I found root@compute:~# telnet compute 11211 Trying 192.168.0.17... Connected to compute. Escape character is '^]'. stats cachedump 15 1 ITEM c9067b617ec1e6e7f78318c19e7ce2c7f4f9dcd6 [2034 b; 0 s] expiration time not setup for the keys. even I set all of the cache_time options, it still not get changed. [identity] password_hash_rounds = 4 driver = sql [assignment] driver = sql [catalog] cache_time = 300 [role] driver = sql cache_time = 300 [resource] driver = sql cache_time = 300 [application_credential] cache_time = 300 [oslo.cache] expiration_time = 300 [cache] memcache_servers = compute:11211 backend = dogpile.cache.memcached enabled = true expiration_time = 300 cache_time = 300 [oslo_messaging_notifications] transport_url = rabbit://stackrabbit:secret@192.168.0.5:5672/ [DEFAULT] max_token_size = 16384 debug = True logging_exception_prefix = ERROR %(name)s %(instance)s logging_default_format_string = %(color)s%(levelname)s %(name)s [-%(color)s] %(instance)s%(color)s%(message)s logging_context_format_string = %(color)s%(levelname)s %(name)s [%(global_request_id)s %(request_id)s %(project_name)s %(user_name)s%(color)s] %(instance)s%(color)s%(message)s logging_debug_format_suffix = {{(pid=%(process)d) %(funcName)s %(pathname)s:%(lineno)d}} admin_endpoint = http://192.168.0.5/identity public_endpoint = http://192.168.0.5/identity [token] provider = fernet cache_time = 300 [database] connection = mysql+pymysql://root:secret@127.0.0.1/keystone?charset=utf8 [fernet_tokens] key_repository = /etc/keystone/fernet-keys/ [credential] key_repository = /etc/keystone/credential-keys/ [security_compliance] unique_last_password_count = 2 lockout_duration = 10 lockout_failure_attempts = 2 [unified_limit] cache_time = 300 To manage notifications about this bug go to: https://bugs.launchpad.net/keystone/+bug/1901124/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1897236] [NEW] create port in a shared network failed for user with member role
Public bug reported: Create a port on a shared network using a user with member role on another project fails. ** Affects: horizon Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Dashboard (Horizon). https://bugs.launchpad.net/bugs/1897236 Title: create port in a shared network failed for user with member role Status in OpenStack Dashboard (Horizon): New Bug description: Create a port on a shared network using a user with member role on another project fails. To manage notifications about this bug go to: https://bugs.launchpad.net/horizon/+bug/1897236/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1896574] Re: how to deal with hypervisor name changing
I think previous title is misleading. Actually hostname itself is still A. what changes is fqdn name seen by hostname --fqdn. ** Changed in: nova Status: Invalid => New ** Summary changed: - how to deal with hypervisor name changing + how to deal with hypervisor host fqdn name changing ** Description changed: - nova fails to correctly account for resources after hypervisor name - changes. For example, if previously the hypervisor name is A, and some - later it switches to A.B, then all of the instances which belong to A + nova fails to correctly account for resources after hypervisor hosntame + fqdn changes. For example, if previously the hypervisor hostname fqdn is + A, and some later it to A.B, then all of the instances which belong to A will not be included in the resource computation for A.B although effectively they are the same thing. + But under such circumstances, compute service's is still A. + Is there any way to deal with this situation? we are using openstack rocky. ** Description changed: nova fails to correctly account for resources after hypervisor hosntame fqdn changes. For example, if previously the hypervisor hostname fqdn is A, and some later it to A.B, then all of the instances which belong to A will not be included in the resource computation for A.B although effectively they are the same thing. - But under such circumstances, compute service's is still A. + But under such circumstances, compute service's is listed as A. Is there any way to deal with this situation? we are using openstack rocky. -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1896574 Title: how to deal with hypervisor host fqdn name changing Status in OpenStack Compute (nova): New Bug description: nova fails to correctly account for resources after hypervisor hosntame fqdn changes. For example, if previously the hypervisor hostname fqdn is A, and some later it to A.B, then all of the instances which belong to A will not be included in the resource computation for A.B although effectively they are the same thing. But under such circumstances, compute service's is listed as A. Is there any way to deal with this situation? we are using openstack rocky. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1896574/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1896574] [NEW] how to deal with hypervisor name changing
Public bug reported: nova fails to correctly account for resources after hypervisor name changes. For example, if previously the hypervisor name is A, and some later it switches to A.B, then all of the instances which belong to A will not be included in the resource computation for A.B although effectively they are the same thing. Is there any way to deal with this situation? we are using openstack rocky. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1896574 Title: how to deal with hypervisor name changing Status in OpenStack Compute (nova): New Bug description: nova fails to correctly account for resources after hypervisor name changes. For example, if previously the hypervisor name is A, and some later it switches to A.B, then all of the instances which belong to A will not be included in the resource computation for A.B although effectively they are the same thing. Is there any way to deal with this situation? we are using openstack rocky. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1896574/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1542032] Re: IP reassembly issue on the Linux bridges in Openstack
** Changed in: neutron Status: Confirmed => Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1542032 Title: IP reassembly issue on the Linux bridges in Openstack Status in neutron: Invalid Bug description: Hi, Sorry for text diagram. It does not look very well on this screen. Please, copy paste in a decent fixed width text editor. Thanks, Claude. Title: IP reassembly issue on the Linux bridges in Openstack Summary: When the security groups and the Neutron firewall are active in Openstack, each and every VM virtual network interfaces (VNIC) is isolated in a Linux bridge and IP reassembly must be performed in order to allow firewall inspection of the traffic. The reassembled traffic sometimes exceed the capacity of the physical interfaces and the traffic is not forwarded properly. Linux bridge diagram: - --| |--| VM | | OVS | --- | -- --- | - - | --- | TAP |-|---| QBR bridge |--| QVB |-|-|QVO| | P |-|| FW-ADMIN || PHY | --- | -- --- | - - | --- | | | - | |--| Introduction: - In Openstack, the virtual machine (VM) uses the OpenvSwitch (OVS) for networking purposes. This is not a mandatory setup but this is a common setup in Openstack. When the Neutron firewall and the security groups are active, each VM VNIC, also called a tap interface, is connected to a Linux bridge. This is the QBR bridge. The QVB interface enables the network communication with OVS. The QVB interface interacts with the QVO interface in OVS. Security analysis is performed on the Linux bridge. In order to perform adequate traffic inspection, the fragmented traffic has to be re-assembled. The traffic is then forwarded according to Maximum Transmit Unit (MTU) of the interfaces in the bridge. The MTU values on all the interfaces are set to 65000 bytes. This is where a part of the problem experienced with NFV applications is observed. Analysis: - As a real life example, the NFV application uses NFS between VMs. NFS is a well known feature in Unix environments. This feature provides network file systems. This is the equivalent of a network drive in the Windows world. NFS is known to produce large frames. In this example, the VM1 (169.254.4.242) send a larg NFS write instruction to the VM2. The example below shows a 5 KB packet. The traffic is fragmented in several packets as instructed by the VM1 VNIC. This is the desired behavior. root@node-11:~# tcpdump -e -n -i tap3e79842d-eb host 169.254.1.13 23:46:48.938255 00:80:37:0e:0f:12 > 00:80:37:0e:0b:12, ethertype IPv4 (0x0800), length 1514: 169.254.4.242.3015988240 > 169.254.1.13.2049: 1472 write fh Unknown/01000601B1198A1CB3CC4E1EA3AB0B26017B0AD653620700D59B28C7 4863 (4863) bytes @ 229376 23:46:48.938271 00:80:37:0e:0f:12 > 00:80:37:0e:0b:12, ethertype IPv4 (0x0800), length 1514: 169.254.4.242 > 169.254.1.13: ip-proto-17 23:46:48.938279 00:80:37:0e:0f:12 > 00:80:37:0e:0b:12, ethertype IPv4 (0x0800), length 1514: 169.254.4.242 > 169.254.1.13: ip-proto-17 23:46:48.938287 00:80:37:0e:0f:12 > 00:80:37:0e:0b:12, ethertype IPv4 (0x0800), length 590: 169.254.4.242 > 169.254.1.13: ip-proto-17 The same packet is found on the QVB interface in one large frame. root@node-11:~# tcpdump -e -n -i qvb3e79842d-eb host 169.254.1.13 23:46:48.938322 00:80:37:0e:0f:12 > 00:80:37:0e:0b:12, ethertype IPv4 (0x0800), length 5030: 169.254.4.242.3015988240 > 169.254.1.13.2049: 4988 write fh Unknown/01000601B1198A1CB3CC4E1EA3AB0B26017B0AD653620700D59B28C7 4863 (4863) bytes @ 229376 Such large packets cannot cross physical interfaces without being fragmented again if jumbo frames support is not active in the network. Even with jumbo frames, the NFS frame size can easily cross the 9K barrier. NFS frame size up to 32 KB can be observed with NFS over UDP. For some reasons, this traffic does not seem to be transmitted properly between compute hosts in Openstack. Further investigations have revealed the large frames are leaving the OVS internal bridge (br-int) in direction of the private bridge (br- prv) using a patch interface in OVS. Once the traffic has reached this point, it uses the "P" interface (i.e.: p_51a2-0) to reach another Linux bridge (br-fw-admin) where the physical interface is connected to. The "P"
[Yahoo-eng-team] [Bug 1895063] [NEW] Allow rescue volume backed instance
Public bug reported: Should we offer support for volume backed instance? ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1895063 Title: Allow rescue volume backed instance Status in OpenStack Compute (nova): New Bug description: Should we offer support for volume backed instance? To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1895063/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1893015] [NEW] ping with large package size fails
Public bug reported: We are using neutron rocky, with security driver set to iptables_hybrid, the cluster is deployed on top of a kubernetes cluster. And all the networks are set to mtu 1500 The problem I am facing right now is that ping across compute nodes fails with a packet size larger than mtu. ping -s 2000 172.20.93.171 Surprisingly, if I ping an IP address from the same node, it works without any issue. I have done a simple tcpdump on qvb like (both on remote and local compute node) tcpdump -i qbv host 172.20.93.171 and icmp And I saw the traffic, but if I am listening on tap or qbr, no traffic is captured. I try to add a log iptable rule to debug, by iptables -t raw -I PREROUTING 1 -m physdev --physdev-in qvb373214e3-8d -p icmp -s 172.20.93.173/12 -j LOG --log-prefix='[netfilter] ' Weird enough, there are no packets counted when packet size set to 2000. ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1893015 Title: ping with large package size fails Status in neutron: New Bug description: We are using neutron rocky, with security driver set to iptables_hybrid, the cluster is deployed on top of a kubernetes cluster. And all the networks are set to mtu 1500 The problem I am facing right now is that ping across compute nodes fails with a packet size larger than mtu. ping -s 2000 172.20.93.171 Surprisingly, if I ping an IP address from the same node, it works without any issue. I have done a simple tcpdump on qvb like (both on remote and local compute node) tcpdump -i qbv host 172.20.93.171 and icmp And I saw the traffic, but if I am listening on tap or qbr, no traffic is captured. I try to add a log iptable rule to debug, by iptables -t raw -I PREROUTING 1 -m physdev --physdev-in qvb373214e3-8d -p icmp -s 172.20.93.173/12 -j LOG --log-prefix='[netfilter] ' Weird enough, there are no packets counted when packet size set to 2000. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1893015/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1892582] [NEW] image creation does not fail immediately if volume not created
Public bug reported: Cinder backend Image creation failed after long waiting when the volume is still creating root@mgt01:~# openstack volume list --all | grep fb8aee1b-e19e-4336-8fa2-864f1664b834 | b1e021bd-974d-4974-961b-47ab7f9b0a16 | image-fb8aee1b-e19e-4336-8fa2-864f1664b834 | creating | 500 | ** Affects: glance Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to Glance. https://bugs.launchpad.net/bugs/1892582 Title: image creation does not fail immediately if volume not created Status in Glance: New Bug description: Cinder backend Image creation failed after long waiting when the volume is still creating root@mgt01:~# openstack volume list --all | grep fb8aee1b-e19e-4336-8fa2-864f1664b834 | b1e021bd-974d-4974-961b-47ab7f9b0a16 | image-fb8aee1b-e19e-4336-8fa2-864f1664b834 | creating | 500 | To manage notifications about this bug go to: https://bugs.launchpad.net/glance/+bug/1892582/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1887108] [NEW] wrong l2pop flows on vlan network
Public bug reported: I saw l2pop rules for a vlan network which causes problems for mac learning. There is no dvr routed associated with it. It is a pure vlan netowrk. root@compute02:/tmp# ovs-ofctl dump-flows br-tun table=21 cookie=0xcd381baa7a6d5b5c, duration=1703630.319s, table=21, n_packets=0, n_bytes=0, priority=1,arp,dl_vlan=1,arp_tpa=172.200.146.36 actions=load:0x2->NXM_OF_ARP_OP[],move:NXM_NX_ARP_SHA[]->NXM_NX_ARP_THA[],move:NXM_OF_ARP_SPA[]->NXM_OF_ARP_TPA[],load:0xfa163ec89311->NXM_NX_ARP_SHA[],load:0xacc89224->NXM_OF_ARP_SPA[],move:NXM_OF_ETH_SRC[]->NXM_OF_ETH_DST[],mod_dl_src:fa:16:3e:c8:93:11,IN_PORT cookie=0xcd381baa7a6d5b5c, duration=1703175.829s, table=21, n_packets=0, n_bytes=0, priority=1,arp,dl_vlan=1,arp_tpa=172.200.146.38 actions=load:0x2->NXM_OF_ARP_OP[],move:NXM_NX_ARP_SHA[]->NXM_NX_ARP_THA[],move:NXM_OF_ARP_SPA[]->NXM_OF_ARP_TPA[],load:0xfa163e8b426c->NXM_NX_ARP_SHA[],load:0xacc89226->NXM_OF_ARP_SPA[],move:NXM_OF_ETH_SRC[]->NXM_OF_ETH_DST[],mod_dl_src:fa:16:3e:8b:42:6c,IN_PORT cookie=0xcd381baa7a6d5b5c, duration=1703156.363s, table=21, n_packets=0, n_bytes=0, priority=1,arp,dl_vlan=1,arp_tpa=172.200.146.37 actions=load:0x2->NXM_OF_ARP_OP[],move:NXM_NX_ARP_SHA[]->NXM_NX_ARP_THA[],move:NXM_OF_ARP_SPA[]->NXM_OF_ARP_TPA[],load:0xfa163e31dd83->NXM_NX_ARP_SHA[],load:0xacc89225->NXM_OF_ARP_SPA[],move:NXM_OF_ETH_SRC[]->NXM_OF_ETH_DST[],mod_dl_src:fa:16:3e:31:dd:83,IN_PORT cookie=0xcd381baa7a6d5b5c, duration=1703137.459s, table=21, n_packets=0, n_bytes=0, priority=1,arp,dl_vlan=1,arp_tpa=172.200.146.39 actions=load:0x2->NXM_OF_ARP_OP[],move:NXM_NX_ARP_SHA[]->NXM_NX_ARP_THA[],move:NXM_OF_ARP_SPA[]->NXM_OF_ARP_TPA[],load:0xfa163e2e8650->NXM_NX_ARP_SHA[],load:0xacc89227->NXM_OF_ARP_SPA[],move:NXM_OF_ETH_SRC[]->NXM_OF_ETH_DST[],mod_dl_src:fa:16:3e:2e:86:50,IN_PORT cookie=0xcd381baa7a6d5b5c, duration=1703090.494s, table=21, n_packets=0, n_bytes=0, priority=1,arp,dl_vlan=1,arp_tpa=172.200.146.41 actions=load:0x2->NXM_OF_ARP_OP[],move:NXM_NX_ARP_SHA[]->NXM_NX_ARP_THA[],move:NXM_OF_ARP_SPA[]->NXM_OF_ARP_TPA[],load:0xfa163e0a4d1c->NXM_NX_ARP_SHA[],load:0xacc89229->NXM_OF_ARP_SPA[],move:NXM_OF_ETH_SRC[]->NXM_OF_ETH_DST[],mod_dl_src:fa:16:3e:0a:4d:1c,IN_PORT cookie=0xcd381baa7a6d5b5c, duration=1703068.578s, table=21, n_packets=0, n_bytes=0, priority=1,arp,dl_vlan=1,arp_tpa=172.200.146.40 actions=load:0x2->NXM_OF_ARP_OP[],move:NXM_NX_ARP_SHA[]->NXM_NX_ARP_THA[],move:NXM_OF_ARP_SPA[]->NXM_OF_ARP_TPA[],load:0xfa163e99553b->NXM_NX_ARP_SHA[],load:0xacc89228->NXM_OF_ARP_SPA[],move:NXM_OF_ETH_SRC[]->NXM_OF_ETH_DST[],mod_dl_src:fa:16:3e:99:55:3b,IN_PORT cookie=0xcd381baa7a6d5b5c, duration=1703050.537s, table=21, n_packets=0, n_bytes=0, priority=1,arp,dl_vlan=1,arp_tpa=172.200.146.45 actions=load:0x2->NXM_OF_ARP_OP[],move:NXM_NX_ARP_SHA[]->NXM_NX_ARP_THA[],move:NXM_OF_ARP_SPA[]->NXM_OF_ARP_TPA[],load:0xfa163ecc5303->NXM_NX_ARP_SHA[],load:0xacc8922d->NXM_OF_ARP_SPA[],move:NXM_OF_ETH_SRC[]->NXM_OF_ETH_DST[],mod_dl_src:fa:16:3e:cc:53:03,IN_PORT cookie=0xcd381baa7a6d5b5c, duration=1703033.613s, table=21, n_packets=0, n_bytes=0, priority=1,arp,dl_vlan=1,arp_tpa=172.200.146.43 actions=load:0x2->NXM_OF_ARP_OP[],move:NXM_NX_ARP_SHA[]->NXM_NX_ARP_THA[],move:NXM_OF_ARP_SPA[]->NXM_OF_ARP_TPA[],load:0xfa163e5ffd39->NXM_NX_ARP_SHA[],load:0xacc8922b->NXM_OF_ARP_SPA[],move:NXM_OF_ETH_SRC[]->NXM_OF_ETH_DST[],mod_dl_src:fa:16:3e:5f:fd:39,IN_PORT root@mgt01:~# openstack port list --fixed-ip ip-address=172.200.146.36 +--+---+---+---++ | ID | Name | MAC Address | Fixed IP Addresses| Status | +--+---+---+---++ | 48131502-da22-4968-9b0b-f1efc3a860a1 | ecs_eni_0 | fa:16:3e:c8:93:11 | ip_address='172.200.146.36', subnet_id='d0890fec-6f33-4f08-8f7c-67fc429c91b8' | ACTIVE | +--+---+---+---++ root@mgt01:~# openstack network show `openstack port show 48131502-da22-4968-9b0b-f1efc3a860a1 -c network_id -f value` +---+--+ | Field | Value| +---+--+ | admin_state_up| UP | | availability_zone_hints | | | availability_zones| az-jiaozuo-zww-1 | | created_at| 2020-06-14T00:09:26Z | | description | | | dns_domain
[Yahoo-eng-team] [Bug 1886355] [NEW] glance upload image to rbd backend stuck
Public bug reported: Uploading image to rbd backend stuck at saving state, and rbd du command shows image size is not increasing, as well as ceph osd pool stats shows that there is no client io. a tcpdump shows the program is actually trying receive from client with a rather small window size (which is 280 bytes) which is considerably small compared to actual image size (35GB). ** Affects: glance Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to Glance. https://bugs.launchpad.net/bugs/1886355 Title: glance upload image to rbd backend stuck Status in Glance: New Bug description: Uploading image to rbd backend stuck at saving state, and rbd du command shows image size is not increasing, as well as ceph osd pool stats shows that there is no client io. a tcpdump shows the program is actually trying receive from client with a rather small window size (which is 280 bytes) which is considerably small compared to actual image size (35GB). To manage notifications about this bug go to: https://bugs.launchpad.net/glance/+bug/1886355/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1884695] [NEW] allow less strict cpu flag comparison
Public bug reported: Description === Nova uses strict cpu flag comparison during live migration, this introduces some problems when migrating with some cpu flags which do not affect actually migration. For example, `monitoring` flag could be neglected safely. So I think it might be reasonable to ignore some features provided from user input, whether static by configuration or dynamically from api input. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1884695 Title: allow less strict cpu flag comparison Status in OpenStack Compute (nova): New Bug description: Description === Nova uses strict cpu flag comparison during live migration, this introduces some problems when migrating with some cpu flags which do not affect actually migration. For example, `monitoring` flag could be neglected safely. So I think it might be reasonable to ignore some features provided from user input, whether static by configuration or dynamically from api input. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1884695/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1884532] [NEW] inconsistent data in ipamallocations
Public bug reported: Sometimes I saw database is not consistent for some reasons, for example, as shown below MariaDB [neutron]> select * from ipamsubnets where neutron_subnet_id='9a8fd2b0-743c-4500-8978-9e5bf9b38347' -> ; +--+--+ | id | neutron_subnet_id| +--+--+ | 85e7171c-2648-4447-ada6-a37c3c113686 | 9a8fd2b0-743c-4500-8978-9e5bf9b38347 | +--+--+ 1 row in set (0.00 sec) MariaDB [neutron]> select * from ipamallocations where ipam_subnet_id = '85e7171c-2648-4447-ada6-a37c3c113686' \G *** 1. row *** ip_address: 10.13.45.1 status: ALLOCATED ipam_subnet_id: 85e7171c-2648-4447-ada6-a37c3c113686 *** 2. row *** ip_address: 10.13.45.2 status: ALLOCATED ipam_subnet_id: 85e7171c-2648-4447-ada6-a37c3c113686 *** 3. row *** ip_address: 10.13.45.3 status: ALLOCATED ipam_subnet_id: 85e7171c-2648-4447-ada6-a37c3c113686 *** 4. row *** ip_address: 10.13.45.4 status: ALLOCATED ipam_subnet_id: 85e7171c-2648-4447-ada6-a37c3c113686 *** 5. row *** ip_address: 10.13.45.5 status: ALLOCATED ipam_subnet_id: 85e7171c-2648-4447-ada6-a37c3c113686 *** 6. row *** ip_address: 10.13.45.6 status: ALLOCATED ipam_subnet_id: 85e7171c-2648-4447-ada6-a37c3c113686 6 rows in set (0.00 sec) MariaDB [neutron]> select * from ipamallocations where ipam_subnet_id = '85e7171c-2648-4447-ada6-a37c3c113686' \G MariaDB [neutron]> select * from ipallocations where subnet_id='9a8fd2b0-743c-4500-8978-9e5bf9b38347' -> ; +--++--+--+ | port_id | ip_address | subnet_id | network_id | +--++--+--+ | 0ae2630a-76d9-47b1-bf2f-012c2356df75 | 10.13.45.1 | 9a8fd2b0-743c-4500-8978-9e5bf9b38347 | c3ea28cd-e76a-4e49-b538-cc05c0173b83 | | 83b4683a-fb57-4844-9e1d-55b111fa0e19 | 10.13.45.2 | 9a8fd2b0-743c-4500-8978-9e5bf9b38347 | c3ea28cd-e76a-4e49-b538-cc05c0173b83 | | 7f0224dd-c49b-42a8-8c8a-bd3aa6c24223 | 10.13.45.3 | 9a8fd2b0-743c-4500-8978-9e5bf9b38347 | c3ea28cd-e76a-4e49-b538-cc05c0173b83 | | f53b335e-535b-42ce-be53-0d6cee48cf28 | 10.13.45.4 | 9a8fd2b0-743c-4500-8978-9e5bf9b38347 | c3ea28cd-e76a-4e49-b538-cc05c0173b83 | +--++--+--+ 4 rows in set (0.00 sec) MariaDB [neutron]> apparently ipam is not consitent with real ip allocations, when this happens some IP address is not allocatible although openstack port list cannot find it. We are using mariadb for production, and I haven't seen problem like this using MySQL. ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1884532 Title: inconsistent data in ipamallocations Status in neutron: New Bug description: Sometimes I saw database is not consistent for some reasons, for example, as shown below MariaDB [neutron]> select * from ipamsubnets where neutron_subnet_id='9a8fd2b0-743c-4500-8978-9e5bf9b38347' -> ; +--+--+ | id | neutron_subnet_id | +--+--+ | 85e7171c-2648-4447-ada6-a37c3c113686 | 9a8fd2b0-743c-4500-8978-9e5bf9b38347 | +--+--+ 1 row in set (0.00 sec) MariaDB [neutron]> select * from ipamallocations where ipam_subnet_id = '85e7171c-2648-4447-ada6-a37c3c113686' \G *** 1. row *** ip_address: 10.13.45.1 status: ALLOCATED ipam_subnet_id: 85e7171c-2648-4447-ada6-a37c3c113686 *** 2. row *** ip_address: 10.13.45.2 status: ALLOCATED ipam_subnet_id: 85e7171c-2648-4447-ada6-a37c3c113686 *** 3. row *** ip_address: 10.13.45.3 status: ALLOCATED
[Yahoo-eng-team] [Bug 1881455] [NEW] migrate server reporting list index of out bound
Public bug reported: Description When resize to local host enabled and do a cold migration sometimes fails with 1. migrating to same host failed 2. and then a list index out of bound error Steps to reproduce === deploy two compute nodes and make workload imbalance, for example compute01 has more allocations than compute02. Then migrate server on compute02. Expected result cold migration succeeded actual result == sometimes failed log == 8084-4fa8-a3c4-2874555fb27c held by migration 0a8a29a5-7f9c-4af3-85a1-ea62ee5658c3 for instance 2020-05-31 02:55:51.649 2419133 ERROR nova.compute.manager [req-ee48014a-51c1-4e82-9ef3-e3b68a9a34e4 5f0b0ff35b914c84b24efb363965530d 0606e9bf4e9c4334b6cb9a5012c60fb8 - default default] [instance: 8189fa53-3e8a-42e3-a735-1d91b9ff0c3b] Error: Unable to migrate instance ( 8189fa53-3e8a-42e3-a735-1d91b9ff0c3b) to current host (compute02).: UnableToMigrateToSelf: Unable to migrate instance (8189fa53-3e8a-42e3-a735-1d91b9ff0c3b) to current host (compute02). 2020-05-31 02:55:51.649 2419133 ERROR nova.compute.manager [instance: 8189fa53-3e8a-42e3-a735-1d91b9ff0c3b] Traceback (most recent call last): 2020-05-31 02:55:51.649 2419133 ERROR nova.compute.manager [instance: 8189fa53-3e8a-42e3-a735-1d91b9ff0c3b] File "/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 4555, in prep_resize 2020-05-31 02:55:51.649 2419133 ERROR nova.compute.manager [instance: 8189fa53-3e8a-42e3-a735-1d91b9ff0c3b] node, migration, clean_shutdown) 2020-05-31 02:55:51.649 2419133 ERROR nova.compute.manager [instance: 8189fa53-3e8a-42e3-a735-1d91b9ff0c3b] File "/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 4499, in _prep_resize 2020-05-31 02:55:51.649 2419133 ERROR nova.compute.manager [instance: 8189fa53-3e8a-42e3-a735-1d91b9ff0c3b] instance_id=instance.uuid, host=self.host) 2020-05-31 02:55:51.649 2419133 ERROR nova.compute.manager [instance: 8189fa53-3e8a-42e3-a735-1d91b9ff0c3b] UnableToMigrateToSelf: Unable to migrate instance (8189fa53-3e8a-42e3-a735-1d91b9ff0c3b) to current host (compute02). 2020-05-31 02:55:51.649 2419133 ERROR nova.compute.manager [instance: 8189fa53-3e8a-42e3-a735-1d91b9ff0c3b] 2020-05-31 02:55:51.649 2419133 ERROR nova.compute.manager [req-ee48014a-51c1-4e82-9ef3-e3b68a9a34e4 5f0b0ff35b914c84b24efb363965530d 0606e9bf4e9c4334b6cb9a5012c60fb8 - default default] [instance: 8189fa53-3e8a-42e3-a735-1d91b9ff0c3b] Error: Unable to migrate instance ( 8189fa53-3e8a-42e3-a735-1d91b9ff0c3b) to current host (compute02).: UnableToMigrateToSelf: Unable to migrate instance (8189fa53-3e8a-42e3-a735-1d91b9ff0c3b) to current host (compute02). 2020-05-31 02:55:51.649 2419133 ERROR nova.compute.manager [instance: 8189fa53-3e8a-42e3-a735-1d91b9ff0c3b] Traceback (most recent call last): 2020-05-31 02:55:51.649 2419133 ERROR nova.compute.manager [instance: 8189fa53-3e8a-42e3-a735-1d91b9ff0c3b] File "/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 4555, in prep_resize 2020-05-31 02:55:51.649 2419133 ERROR nova.compute.manager [instance: 8189fa53-3e8a-42e3-a735-1d91b9ff0c3b] node, migration, clean_shutdown) 2020-05-31 02:55:51.649 2419133 ERROR nova.compute.manager [instance: 8189fa53-3e8a-42e3-a735-1d91b9ff0c3b] File "/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 4499, in _prep_resize 2020-05-31 02:55:51.649 2419133 ERROR nova.compute.manager [instance: 8189fa53-3e8a-42e3-a735-1d91b9ff0c3b] instance_id=instance.uuid, host=self.host) 2020-05-31 02:55:51.649 2419133 ERROR nova.compute.manager [instance: 8189fa53-3e8a-42e3-a735-1d91b9ff0c3b] UnableToMigrateToSelf: Unable to migrate instance (8189fa53-3e8a-42e3-a735-1d91b9ff0c3b) to current host (compute02). 2020-05-31 02:55:51.649 2419133 ERROR nova.compute.manager [instance: 8189fa53-3e8a-42e3-a735-1d91b9ff0c3b] 2020-05-31 02:55:51.740 2419133 ERROR nova.compute.manager [req-ee48014a-51c1-4e82-9ef3-e3b68a9a34e4 5f0b0ff35b914c84b24efb363965530d 0606e9bf4e9c4334b6cb9a5012c60fb8 - default default] [instance: 8189fa53-3e8a-42e3-a735-1d91b9ff0c3b] Setting instance vm_state to ERROR: IndexError: list index out of range 2020-05-31 02:55:51.740 2419133 ERROR nova.compute.manager [instance: 8189fa53-3e8a-42e3-a735-1d91b9ff0c3b] Traceback (most recent call last): 2020-05-31 02:55:51.740 2419133 ERROR nova.compute.manager [instance: 8189fa53-3e8a-42e3-a735-1d91b9ff0c3b] File "/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 8333, in _error_out_instance_on_exception 2020-05-31 02:55:51.740 2419133 ERROR nova.compute.manager [instance: 8189fa53-3e8a-42e3-a735-1d91b9ff0c3b] yield 2020-05-31 02:55:51.740 2419133 ERROR nova.compute.manager [instance: 8189fa53-3e8a-42e3-a735-1d91b9ff0c3b] File "/var/lib/openstack/lib/python2.7/site-packages/nova/compute/manager.py", line 4576,
[Yahoo-eng-team] [Bug 1880455] [NEW] interrupted vlan connection after live migration
Public bug reported: After https://github.com/openstack/neutron/commit/efa8dd08957b5b6b1a05f0ed412ff00462a9f216 this patch, I saw unexpected vlan interruption after live migration. The steps to reproduce the problem is simple, first create two vm01, vm02 on compute01 and compute02 separately, then live migrate vm02 to compute01, after it completes live migrate vm02 to compute02. After this you saw vm01 cannot access vm02. And ovs-appctl dpif/dump-flows br-int saw flow from vm01 to vm02 are dropped. I am now suspecting the following code are never executed https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L685 because for nova port are removed before delete port get called. ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1880455 Title: interrupted vlan connection after live migration Status in neutron: New Bug description: After https://github.com/openstack/neutron/commit/efa8dd08957b5b6b1a05f0ed412ff00462a9f216 this patch, I saw unexpected vlan interruption after live migration. The steps to reproduce the problem is simple, first create two vm01, vm02 on compute01 and compute02 separately, then live migrate vm02 to compute01, after it completes live migrate vm02 to compute02. After this you saw vm01 cannot access vm02. And ovs-appctl dpif/dump-flows br-int saw flow from vm01 to vm02 are dropped. I am now suspecting the following code are never executed https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L685 because for nova port are removed before delete port get called. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1880455/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1870866] [NEW] inconsistent connection info data after live migration
Public bug reported: Description === after live migration, block device mapping's connection stays at "attaching", which is confusing piece of information. The root cause seems caused by different code path between live migration and attach volume. Steps to reproduce == attach a volume and then live migrate to different host. Expected result consistent information. Either there is no info nor connection info should be reserved. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1870866 Title: inconsistent connection info data after live migration Status in OpenStack Compute (nova): New Bug description: Description === after live migration, block device mapping's connection stays at "attaching", which is confusing piece of information. The root cause seems caused by different code path between live migration and attach volume. Steps to reproduce == attach a volume and then live migrate to different host. Expected result consistent information. Either there is no info nor connection info should be reserved. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1870866/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1869808] [NEW] reboot neutron-ovs-agent introduces a short interrupt of vlan traffic
Public bug reported: We are using Openstack Neutron 13.0.6 and it is deployed using OpenStack-helm. I test ping servers in the same vlan while rebooting neutron-ovs-agent. The result shows root@mgt01:~# openstack server list +--+-++--+--+---+ | ID | Name| Status | Networks | Image| Flavor| +--+-++--+--+---+ | 22d55077-b1b5-452e-8eba-cbcd2d1514a8 | test-1-1| ACTIVE | vlan105=172.31.10.4 | Cirros 0.4.0 64-bit | m1.tiny | | 726bc888-7767-44bc-b68a-7a1f3a6babf1 | test-1-2| ACTIVE | vlan105=172.31.10.18 | Cirros 0.4.0 64-bit | m1.tiny | $ ping 172.31.10.4 PING 172.31.10.4 (172.31.10.4): 56 data bytes .. 64 bytes from 172.31.10.4: seq=59 ttl=64 time=0.465 ms 64 bytes from 172.31.10.4: seq=60 ttl=64 time=0.510 ms < 64 bytes from 172.31.10.4: seq=61 ttl=64 time=0.446 ms 64 bytes from 172.31.10.4: seq=63 ttl=64 time=0.744 ms 64 bytes from 172.31.10.4: seq=64 ttl=64 time=0.477 ms 64 bytes from 172.31.10.4: seq=65 ttl=64 time=0.441 ms 64 bytes from 172.31.10.4: seq=66 ttl=64 time=0.376 ms 64 bytes from 172.31.10.4: seq=67 ttl=64 time=0.481 ms As one can see, packet seq 62 is lost, I believe, during rebooting ovs agent. Right now, I am suspecting https://github.com/openstack/neutron/blob/6d619ea7c13e89ec575295f04c63ae316759c50a/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py#L229 this code is refreshing flow table rules even though it is not necessary. Because when I dump flows on phys bridge, I can see duration is rewinding to 0 which suggests flow has been deleted and created again """ duration=secs The time, in seconds, that the entry has been in the table. secs includes as much precision as the switch provides, possibly to nanosecond resolution. """ root@compute01:~# ovs-ofctl dump-flows br-floating ... cookie=0x673522f560f5ca4f, duration=323.852s, table=2, n_packets=1100, n_bytes=103409, ^-- this value resets priority=4,in_port="phy-br-floating",dl_vlan=2 actions=mod_vlan_vid:105,NORMAL ... IMO, rebooting ovs-agent should not affecting data plane. ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1869808 Title: reboot neutron-ovs-agent introduces a short interrupt of vlan traffic Status in neutron: New Bug description: We are using Openstack Neutron 13.0.6 and it is deployed using OpenStack-helm. I test ping servers in the same vlan while rebooting neutron-ovs- agent. The result shows root@mgt01:~# openstack server list +--+-++--+--+---+ | ID | Name| Status | Networks | Image| Flavor| +--+-++--+--+---+ | 22d55077-b1b5-452e-8eba-cbcd2d1514a8 | test-1-1| ACTIVE | vlan105=172.31.10.4 | Cirros 0.4.0 64-bit | m1.tiny | | 726bc888-7767-44bc-b68a-7a1f3a6babf1 | test-1-2| ACTIVE | vlan105=172.31.10.18 | Cirros 0.4.0 64-bit | m1.tiny | $ ping 172.31.10.4 PING 172.31.10.4 (172.31.10.4): 56 data bytes .. 64 bytes from 172.31.10.4: seq=59 ttl=64 time=0.465 ms 64 bytes from 172.31.10.4: seq=60 ttl=64 time=0.510 ms < 64 bytes from 172.31.10.4: seq=61 ttl=64 time=0.446 ms 64 bytes from 172.31.10.4: seq=63 ttl=64 time=0.744 ms 64 bytes from 172.31.10.4: seq=64 ttl=64 time=0.477 ms 64 bytes from 172.31.10.4: seq=65 ttl=64 time=0.441 ms 64 bytes from 172.31.10.4: seq=66 ttl=64 time=0.376 ms 64 bytes from 172.31.10.4: seq=67 ttl=64 time=0.481 ms As one can see, packet seq 62 is lost, I believe, during rebooting ovs agent. Right now, I am suspecting https://github.com/openstack/neutron/blob/6d619ea7c13e89ec575295f04c63ae316759c50a/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py#L229 this code is refreshing flow table rules even though it is not necessary. Because when I dump flows on phys bridge, I can see duration is rewinding to 0 which suggests flow has been deleted and created again """
[Yahoo-eng-team] [Bug 1866288] [NEW] tox pep8 fails on ubuntu 18.04.3
Public bug reported: pep8 checking fails for rocky branch on ubuntu 18.04.3 root@mgt02:~/src/nova# tox -epep8 -vvv removing /root/src/nova/.tox/log using tox.ini: /root/src/nova/tox.ini using tox-3.1.0 from /usr/local/lib/python2.7/dist-packages/tox/__init__.pyc skipping sdist step pep8 start: getenv /root/src/nova/.tox/shared pep8 recreate: /root/src/nova/.tox/shared ERROR: InterpreterNotFound: python3.5 pep8 finish: getenv after 0.00 seconds __ summary ___ ERROR: pep8: InterpreterNotFound: python3.5 root@mgt02:~/src/nova# uname -a Linux mgt02 4.15.0-88-generic #88-Ubuntu SMP Tue Feb 11 20:11:34 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux ** Affects: nova Importance: Undecided Status: New ** Description changed: pep8 checking fails for rocky branch on ubuntu 18.04.3 - root@mgt02:~/src/nova# tox -epep8 -vvv - removing /root/src/nova/.tox/log + removing /root/src/nova/.tox/log using tox.ini: /root/src/nova/tox.ini using tox-3.1.0 from /usr/local/lib/python2.7/dist-packages/tox/__init__.pyc skipping sdist step pep8 start: getenv /root/src/nova/.tox/shared pep8 recreate: /root/src/nova/.tox/shared ERROR: InterpreterNotFound: python3.5 pep8 finish: getenv after 0.00 seconds __ summary ___ ERROR: pep8: InterpreterNotFound: python3.5 + + + root@mgt02:~/src/nova# uname -a + Linux mgt02 4.15.0-88-generic #88-Ubuntu SMP Tue Feb 11 20:11:34 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1866288 Title: tox pep8 fails on ubuntu 18.04.3 Status in OpenStack Compute (nova): New Bug description: pep8 checking fails for rocky branch on ubuntu 18.04.3 root@mgt02:~/src/nova# tox -epep8 -vvv removing /root/src/nova/.tox/log using tox.ini: /root/src/nova/tox.ini using tox-3.1.0 from /usr/local/lib/python2.7/dist-packages/tox/__init__.pyc skipping sdist step pep8 start: getenv /root/src/nova/.tox/shared pep8 recreate: /root/src/nova/.tox/shared ERROR: InterpreterNotFound: python3.5 pep8 finish: getenv after 0.00 seconds __ summary ___ ERROR: pep8: InterpreterNotFound: python3.5 root@mgt02:~/src/nova# uname -a Linux mgt02 4.15.0-88-generic #88-Ubuntu SMP Tue Feb 11 20:11:34 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1866288/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1865120] [NEW] arm64 vm boot failed when set num_pcie_ports to 28
Public bug reported: We are testing OpenStack on Phytium,FT2000PLUS root@compute01:~# lscpu Architecture: aarch64 Byte Order:Little Endian CPU(s):64 On-line CPU(s) list: 0-63 Thread(s) per core:1 Core(s) per socket:4 Socket(s): 16 NUMA node(s): 8 Model name:Phytium,FT2000PLUS CPU max MHz: 2200. CPU min MHz: 1000. BogoMIPS: 3600.00 NUMA node0 CPU(s): 0-7 NUMA node1 CPU(s): 8-15 NUMA node2 CPU(s): 16-23 NUMA node3 CPU(s): 24-31 NUMA node4 CPU(s): 32-39 NUMA node5 CPU(s): 40-47 NUMA node6 CPU(s): 48-55 NUMA node7 CPU(s): 56-63 Flags: fp asimd evtstrm crc32 The problem we initially met are we are not able to attach to more than 2 volumes (virtio-blk) if config drive enabled. We somehow work around the problem by using scsi-bus instead. But we are still interesting to make plug more than 2 virtio-blk devices possible, and after some investigation I think `num_pcie_ports` might be too small (looks like it default to 9 if unspecified), and `pcie-root` does not allow hot plugging, and `pcie-root-port` does not allow more than 1 slots, so the only way I am thinking to mitigate the problem is to increase this option to maximum. But the current problem is vms with previously working images failed to boot and when I try to virsh console, I only saw the uefi shell console. Maybe this is not a bug for `code`, but I definitely think it is necessary to improve the doc and make it easier to understand these terms. I am glad to provide to additional details if asked. thanks ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1865120 Title: arm64 vm boot failed when set num_pcie_ports to 28 Status in OpenStack Compute (nova): New Bug description: We are testing OpenStack on Phytium,FT2000PLUS root@compute01:~# lscpu Architecture: aarch64 Byte Order:Little Endian CPU(s):64 On-line CPU(s) list: 0-63 Thread(s) per core:1 Core(s) per socket:4 Socket(s): 16 NUMA node(s): 8 Model name:Phytium,FT2000PLUS CPU max MHz: 2200. CPU min MHz: 1000. BogoMIPS: 3600.00 NUMA node0 CPU(s): 0-7 NUMA node1 CPU(s): 8-15 NUMA node2 CPU(s): 16-23 NUMA node3 CPU(s): 24-31 NUMA node4 CPU(s): 32-39 NUMA node5 CPU(s): 40-47 NUMA node6 CPU(s): 48-55 NUMA node7 CPU(s): 56-63 Flags: fp asimd evtstrm crc32 The problem we initially met are we are not able to attach to more than 2 volumes (virtio-blk) if config drive enabled. We somehow work around the problem by using scsi-bus instead. But we are still interesting to make plug more than 2 virtio-blk devices possible, and after some investigation I think `num_pcie_ports` might be too small (looks like it default to 9 if unspecified), and `pcie-root` does not allow hot plugging, and `pcie- root-port` does not allow more than 1 slots, so the only way I am thinking to mitigate the problem is to increase this option to maximum. But the current problem is vms with previously working images failed to boot and when I try to virsh console, I only saw the uefi shell console. Maybe this is not a bug for `code`, but I definitely think it is necessary to improve the doc and make it easier to understand these terms. I am glad to provide to additional details if asked. thanks To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1865120/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1856962] [NEW] openid method failed when federation_group_ids is empty list
s/keystone/auth/plugins/mapped.py", line 80, in handle_scoped_token 2019-12-17 02:25:09.345722 2019-12-17 02:25:09.343 10 ERROR keystone.common.wsgi for group_dict in token.federated_groups: 2019-12-17 02:25:09.345726 2019-12-17 02:25:09.343 10 ERROR keystone.common.wsgi TypeError: 'NoneType' object is not iterable 2019-12-17 02:25:09.345730 2019-12-17 02:25:09.343 10 ERROR keystone.common.wsgi 10.16.4.45 - - [17/Dec/2019:02:25:09 +] "POST /v3/auth/tokens HTTP/1.1" 400 96 "-" "curl/7.58.0" OpenStack Version: Rocky We are hitting this error message when using keystone federation. The mapping is simple as follow: [ { "remote":[ { "type":"REMOTE_USER" }, { "type":"OIDC-project" } ], "local":[ { "user":{ "name":"{0}" } }, { "projects":[ { "name":"{1}", "roles":[ { "name":"member" } ] } ] } ] } ] ** Affects: keystone Importance: Undecided Assignee: norman shen (jshen28) Status: In Progress -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Identity (keystone). https://bugs.launchpad.net/bugs/1856962 Title: openid method failed when federation_group_ids is empty list Status in OpenStack Identity (keystone): In Progress Bug description: LOG: 2019-12-17 02:25:09.269827 2019-12-17 02:25:09.269 10 INFO keystone.common.wsgi [req-521eb002-385e-4015-8035-16bfbdcf0d33 - - - - -] POST http://keystone.openstack.svc.region-guiyang-zyy.myinspurcloud.com/v3/auth/tokens 2019-12-17 02:25:09.270180 2019-12-17 02:25:09.269 10 INFO keystone.common.wsgi [req-521eb002-385e-4015-8035-16bfbdcf0d33 - - - - -] POST http://keystone.openstack.svc.region-guiyang-zyy.myinspurcloud.com/v3/auth/tokens 2019-12-17 02:25:09.298401 2019-12-17 02:25:09.297 10 WARNING keystone.common.fernet_utils [req-521eb002-385e-4015-8035-16bfbdcf0d33 - - - - -] key_repository is world readable: /etc/keystone/fernet-keys/: NeedRegenerationException 2019-12-17 02:25:09.298764 2019-12-17 02:25:09.297 10 WARNING keystone.common.fernet_utils [req-521eb002-385e-4015-8035-16bfbdcf0d33 - - - - -] key_repository is world readable: /etc/keystone/fernet-keys/: NeedRegenerationException 2019-12-17 02:25:09.344893 2019-12-17 02:25:09.343 10 ERROR keystone.common.wsgi [req-521eb002-385e-4015-8035-16bfbdcf0d33 - - - - -] 'NoneType' object is not iterable: TypeError: 'NoneType' object is not iterable 2019-12-17 02:25:09.344916 2019-12-17 02:25:09.343 10 ERROR keystone.common.wsgi Traceback (most recent call last): 2019-12-17 02:25:09.344921 2019-12-17 02:25:09.343 10 ERROR keystone.common.wsgi File "/var/lib/openstack/local/lib/python2.7/site-packages/keystone/common/wsgi.py", line 148, in __call__ 2019-12-17 02:25:09.344925 2019-12-17 02:25:09.343 10 ERROR keystone.common.wsgi result = method(req, **params) 2019-12-17 02:25:09.344929 2019-12-17 02:25:09.343 10 ERROR keystone.common.wsgi File "/var/lib/openstack/local/lib/python2.7/site-packages/keystone/auth/controllers.py", line 67, in authenticate_for_token 2019-12-17 02:25:09.344934 2019-12-17 02:25:09.343 10 ERROR keystone.common.wsgi self.authenticate(request, auth_info, auth_context) 2019-12-17 02:25:09.344938 2019-12-17 02:25:09.343 10 ERROR keystone.common.wsgi File "/var/lib/openstack/local/lib/python2.7/site-packages/keystone/auth/controllers.py", line 236, in authenticate 2019-12-17 02:25:09.344942 2019-12-17 02:25:09.343 10 ERROR keystone.common.wsgi auth_info.get_method_data(method_name)) 2019-12-17 02:25:09.344945 2019-12-17 02:25:09.343 10 ERROR keystone.common.wsgi File "/var/lib/openstack/local/lib/python2.7/site-packages/keystone/auth/plugins/mapped.py", line 58, in authenticate 2019-12-17 02:25:09.344949 2019-12-17 02:25:09.343 10 ERROR keystone.common.wsgi PROVIDERS.identity_api) 2019-12-17 02:25:09.344953 2019-12-17 02:25:09.343 10 ERROR keystone.common.wsgi File "/var/lib/openstack/local/lib/python2.7/site-packages/keystone/auth/plugins/mapped.py", line 80, in handle_scoped_token 2019-12-17 02:25:09.344957 2019-12-17 02:25:09.343 10 ERROR keystone.common.wsgi for group_dict in token.federated_groups: 2019-12-17 02:25:09.344961 2019-12-17 02:25:09.343 10 ERROR keystone.common.wsgi TypeError: 'NoneType' object is not iterable 2019-12-17 02:25:09.344965 2019-12-17 02:25:09.343 10 ERROR keystone.common.wsgi 20
[Yahoo-eng-team] [Bug 1856312] [NEW] RuntimeError during calling log_opts_values
Public bug reported: During starting up nova-compute service, we are hit by the following error message + sed -i s/HOST_IP// /tmp/logging-nova-compute.conf + exec nova-compute --config-file /etc/nova/nova.conf --config-file /tmp/pod-shared/nova-console.conf --config-file /tmp/pod-shared/nova-libvirt.conf --config-file /tmp/pod-shared/nova-hypervisor.conf --log-config-append /tmp/logging-nova-compute.conf 2019-12-13 06:53:09.556 29036 WARNING oslo_config.cfg [-] Deprecated: Option "use_neutron" from group "DEFAULT" is deprecated for removal ( nova-network is deprecated, as are any related configuration options. ). Its value may be silently ignored in the future. 2019-12-13 06:53:12.000 29036 INFO nova.compute.rpcapi [req-eec76cc3-35a9-4d1d-bb91-4c484f6ef855 - - - - -] Automatically selected compute RPC version 5.0 from minimum service version 35 2019-12-13 06:53:12.000 29036 INFO nova.compute.rpcapi [req-eec76cc3-35a9-4d1d-bb91-4c484f6ef855 - - - - -] Automatically selected compute RPC version 5.0 from minimum service version 35 2019-12-13 06:53:12.029 29036 INFO nova.virt.driver [req-eec76cc3-35a9-4d1d-bb91-4c484f6ef855 - - - - -] Loading compute driver 'libvirt.LibvirtDriver' 2019-12-13 06:53:12.029 29036 INFO nova.virt.driver [req-eec76cc3-35a9-4d1d-bb91-4c484f6ef855 - - - - -] Loading compute driver 'libvirt.LibvirtDriver' 2019-12-13 06:53:22.064 29036 WARNING oslo_config.cfg [req-eec76cc3-35a9-4d1d-bb91-4c484f6ef855 - - - - -] Deprecated: Option "firewall_driver" from group "DEFAULT" is deprecated for removal ( nova-network is deprecated, as are any related configuration options. ). Its value may be silently ignored in the future. 2019-12-13 06:53:22.192 29036 WARNING os_brick.initiator.connectors.remotefs [req-eec76cc3-35a9-4d1d-bb91-4c484f6ef855 - - - - -] Connection details not present. RemoteFsClient may not initialize properly. 2019-12-13 06:53:22.409 29036 WARNING oslo_config.cfg [req-eec76cc3-35a9-4d1d-bb91-4c484f6ef855 - - - - -] Deprecated: Option "linuxnet_interface_driver" from group "DEFAULT" is deprecated for removal ( nova-network is deprecated, as are any related configuration options. ). Its value may be silently ignored in the future. 2019-12-13 06:53:22.414 29036 WARNING oslo_config.cfg [req-eec76cc3-35a9-4d1d-bb91-4c484f6ef855 - - - - -] Deprecated: Option "metadata_port" from group "DEFAULT" is deprecated for removal ( nova-network is deprecated, as are any related configuration options. ). Its value may be silently ignored in the future. 2019-12-13 06:53:22.440 29036 INFO nova.service [-] Starting compute node (version 18.0.0) 2019-12-13 06:53:22.570 29036 WARNING oslo_config.cfg [req-eec76cc3-35a9-4d1d-bb91-4c484f6ef855 - - - - -] Deprecated: Option "api_endpoint" from group "ironic" is deprecated for removal (Endpoint lookup uses the service catalog via common keystoneauth1 Adapter configuration options. In the current release, api_endpoint will override this behavior, but will be ignored and/or removed in a future release. To achieve the same result, use the endpoint_override option instead.). Its value may be silently ignored in the future. 2019-12-13 06:53:22.440 29036 INFO nova.service [-] Starting compute node (version 18.0.0) 2019-12-13 06:53:22.594 29036 WARNING oslo_config.cfg [req-eec76cc3-35a9-4d1d-bb91-4c484f6ef855 - - - - -] Deprecated: Option "api_endpoint" from group "ironic" is deprecated. Use option "endpoint-override" from group "ironic". 2019-12-13 06:53:22.911 29036 CRITICAL nova [req-eec76cc3-35a9-4d1d-bb91-4c484f6ef855 - - - - -] Unhandled error: RuntimeError: dictionary changed size during iteration 2019-12-13 06:53:22.911 29036 ERROR nova Traceback (most recent call last): 2019-12-13 06:53:22.911 29036 ERROR nova File "/var/lib/openstack/bin/nova-compute", line 8, in 2019-12-13 06:53:22.911 29036 ERROR nova sys.exit(main()) 2019-12-13 06:53:22.911 29036 ERROR nova File "/var/lib/openstack/local/lib/python2.7/site-packages/nova/inspur/cmd/compute.py", line 71, in main 2019-12-13 06:53:22.911 29036 ERROR nova service.wait() 2019-12-13 06:53:22.911 29036 ERROR nova File "/var/lib/openstack/local/lib/python2.7/site-packages/nova/service.py", line 460, in wait 2019-12-13 06:53:22.911 29036 ERROR nova _launcher.wait() 2019-12-13 06:53:22.911 29036 ERROR nova File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_service/service.py", line 392, in wait 2019-12-13 06:53:22.911 29036 ERROR nova status, signo = self._wait_for_exit_or_signal() 2019-12-13 06:53:22.911 29036 ERROR nova File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_service/service.py", line 367, in _wait_for_exit_or_signal 2019-12-13 06:53:22.911 29036 ERROR nova self.conf.log_opt_values(LOG, logging.DEBUG) 2019-12-13 06:53:22.911 29036 ERROR nova File "/var/lib/openstack/local/lib/python2.7/site-packages/oslo_config/cfg.py", line 2579, in log_opt_values 2019-12-13 06:53:22.911 29036 ERROR
[Yahoo-eng-team] [Bug 1840579] [NEW] excessive number of dvrs where vm got a fixed ip on floating network
Public bug reported: we are running into an unexpected situation where number of dvr routers is increasing to nearly 2000 on a compute node on which some instances got a nic on floating ip network. We are using Queens release, neutron-common/xenial,now 2:12.0.5-5~u16.04+mcp155 all [installed,automatic] neutron-l3-agent/xenial,now 2:12.0.5-5~u16.04+mcp155 all [installed] neutron-metadata-agent/xenial,now 2:12.0.5-5~u16.04+mcp155 all [installed,automatic] neutron-openvswitch-agent/xenial,now 2:12.0.5-5~u16.04+mcp155 all [installed] python-neutron/xenial,now 2:12.0.5-5~u16.04+mcp155 all [installed,automatic] python-neutron-fwaas/xenial,xenial,now 2:12.0.1-1.0~u16.04+mcp6 all [installed,automatic] python-neutron-lib/xenial,xenial,now 1.13.0-1.0~u16.04+mcp9 all [installed,automatic] python-neutronclient/xenial,xenial,now 1:6.7.0-1.0~u16.04+mcp17 all [installed,automatic] Currently, my guess is that some applications mistakenly invokes rpc calls like this https://github.com/openstack/neutron/blob/490471ebd3ac56d0cee164b9c1c1211687e49437/neutron/api/rpc/agentnotifiers/l3_rpc_agent_api.py#L166 with dvr associated with a floating ip address on a host which has fixed ip address allocated from floating network (aka device_owner prefix with compute:). Then such router will be kept by this https://github.com/openstack/neutron/blob/490471ebd3ac56d0cee164b9c1c1211687e49437/neutron/db/l3_dvrscheduler_db.py#L427 function, because `get_subnet_ids_on_router` does not filter out router:gateway ports. I think this is a bug because as long as we do not have ports with specific device owners we should not have a dvr router on it. besides it is pretty easy to replay this bug. First create a dvr router with an external gateway on floating network Then create on virtual machine with fixed ip on floating network Then call `routers_updated_on_host` manually, then this dvr will be created on the host where vm resides on, but actually it should be there. ** Affects: neutron Importance: Undecided Assignee: norman shen (jshen28) Status: In Progress ** Description changed: we are running into an unexpected situation where number of dvr routers is increasing to nearly 2000 on a compute node on which some instances got a nic on floating ip network. We are using Queens release, neutron-common/xenial,now 2:12.0.5-5~u16.04+mcp155 all [installed,automatic] neutron-l3-agent/xenial,now 2:12.0.5-5~u16.04+mcp155 all [installed] neutron-metadata-agent/xenial,now 2:12.0.5-5~u16.04+mcp155 all [installed,automatic] neutron-openvswitch-agent/xenial,now 2:12.0.5-5~u16.04+mcp155 all [installed] python-neutron/xenial,now 2:12.0.5-5~u16.04+mcp155 all [installed,automatic] python-neutron-fwaas/xenial,xenial,now 2:12.0.1-1.0~u16.04+mcp6 all [installed,automatic] python-neutron-lib/xenial,xenial,now 1.13.0-1.0~u16.04+mcp9 all [installed,automatic] python-neutronclient/xenial,xenial,now 1:6.7.0-1.0~u16.04+mcp17 all [installed,automatic] Currently, my guess is that some applications mistakenly invokes rpc calls like this https://github.com/openstack/neutron/blob/490471ebd3ac56d0cee164b9c1c1211687e49437/neutron/api/rpc/agentnotifiers/l3_rpc_agent_api.py#L166 with dvr associated with a floating ip address on a host which has fixed ip address allocated from floating network (aka device_owner prefix with compute:). Then such router will be kept by this https://github.com/openstack/neutron/blob/490471ebd3ac56d0cee164b9c1c1211687e49437/neutron/db/l3_dvrscheduler_db.py#L427 function, because `get_subnet_ids_on_router` does not filter out router:gateway ports. I think this is a bug because as long as we do not have ports with specific device owners we should not have a dvr router on it. + + + besides it is pretty easy to replay this bug. + + First create a dvr router with an external gateway on floating network + Then create on virtual machine with fixed ip on floating network + Then call `routers_updated_on_host` manually, then this dvr will be created on the host where vm resides on, but actually it should be there. -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1840579 Title: excessive number of dvrs where vm got a fixed ip on floating network Status in neutron: In Progress Bug description: we are running into an unexpected situation where number of dvr routers is increasing to nearly 2000 on a compute node on which some instances got a nic on floating ip network. We are using Queens release, neutron-common/xenial,now 2:12.0.5-5~u16.04+mcp155 all [installed,automatic] neutron-l3-agent/xenial,now 2:12.0.5-5~u16.04+mcp155 all [installed] neutron-metadata-agent/xenial,now 2:12.0.5-5~u16.04+mcp155 all [installed,automatic] neutron-openvswitch-agent/xenial,now 2:12.0.5-5~u16.04+mcp155 all [installed] python-neutron/xenial,now 2:12.0.5
[Yahoo-eng-team] [Bug 1836680] [NEW] attach volume succeeded but device not found on guest machine
Public bug reported: sorry post bug at wrong place. ** Affects: neutron Importance: Undecided Status: Invalid ** Changed in: neutron Status: New => Invalid ** Description changed: - we are using OpenStack Queens: - nova-common/xenial,now 2:17.0.9-6~u16.01+mcp189 all [installed] - nova-compute/xenial,now 2:17.0.9-6~u16.01+mcp189 all [installed,automatic] - nova-compute-kvm/xenial,now 2:17.0.9-6~u16.01+mcp189 all [installed] - - guest vm uses windows 2012 datacenter edition - - after successfully executing openstack server add volume ${instance_id} - ${volume_id}, we observe volume status has changed to in-used and - attachments info are correctly stored in both nova and neutron. But - device does not show up in guest machine. - - we execute `virsh dumpxml ${instance_id}` but device is not there. We - then try to edit directly by executing `virsh edit ${instance_id}` and - we see the device with proper attachments info... - - At last we have to shutdown the vm and boot again to solve the problem. - - - command line outputs are put below, - - /var/lib/libvirt/qemu# virsh dumpxml 55 --inactive - - - - - - - . - - # virsh domblklist 55 - Target Source - - vdavms/xxx - vdbvms/ - - manually attach vdc reports `vdc` in-used + sorry post bug at wrong place. -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1836680 Title: attach volume succeeded but device not found on guest machine Status in neutron: Invalid Bug description: sorry post bug at wrong place. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1836680/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1836681] [NEW] attach volume succeeded but device not found on guest machine
Public bug reported: we are using OpenStack Queens: nova-common/xenial,now 2:17.0.9-6~u16.01+mcp189 all [installed] nova-compute/xenial,now 2:17.0.9-6~u16.01+mcp189 all [installed,automatic] nova-compute-kvm/xenial,now 2:17.0.9-6~u16.01+mcp189 all [installed] guest vm uses windows 2012 datacenter edition after successfully executing openstack server add volume ${instance_id} ${volume_id}, we observe volume status has changed to in-used and attachments info are correctly stored in both nova and neutron. But device does not show up in guest machine. we execute `virsh dumpxml ${instance_id}` but device is not there. We then try to edit directly by executing `virsh edit ${instance_id}` and we see the device with proper attachments info... At last we have to shutdown the vm and boot again to solve the problem. command line outputs are put below, /var/lib/libvirt/qemu# virsh dumpxml 55 --inactive . # virsh domblklist 55 Target Source vdavms/xxx vdbvms/ manually attach vdc reports `vdc` in-used ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1836681 Title: attach volume succeeded but device not found on guest machine Status in OpenStack Compute (nova): New Bug description: we are using OpenStack Queens: nova-common/xenial,now 2:17.0.9-6~u16.01+mcp189 all [installed] nova-compute/xenial,now 2:17.0.9-6~u16.01+mcp189 all [installed,automatic] nova-compute-kvm/xenial,now 2:17.0.9-6~u16.01+mcp189 all [installed] guest vm uses windows 2012 datacenter edition after successfully executing openstack server add volume ${instance_id} ${volume_id}, we observe volume status has changed to in-used and attachments info are correctly stored in both nova and neutron. But device does not show up in guest machine. we execute `virsh dumpxml ${instance_id}` but device is not there. We then try to edit directly by executing `virsh edit ${instance_id}` and we see the device with proper attachments info... At last we have to shutdown the vm and boot again to solve the problem. command line outputs are put below, /var/lib/libvirt/qemu# virsh dumpxml 55 --inactive . # virsh domblklist 55 Target Source vdavms/xxx vdbvms/ manually attach vdc reports `vdc` in-used To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1836681/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1830456] [NEW] dvr router slow response during port update
Public bug reported: We are having a distributed router which used by hundreds of virtual machines scattered across around 150 compute nodes. When nova sends port update request to neutron, it will generally taking nearly 4 min to complete. Neutron version is openstack Queens 12.0.5. I found the following log entries printed by neutron-server, 2019-05-25 05:24:16,285.285 11834 INFO neutron.wsgi [req- x - default default] x.x.x.x "PUT /v2.0/ports/8c252d91-741a-4627-9600-916d1da5178f HTTP/1.1" status: 200 len: 0 time: 233.6103470 You can see it takes around 240 seconds to finish request. Right now I am suspecting this code snippet https://github.com/openstack/neutron/blob/de59a21754747335d0d9d26082c7f0df105a30c9/neutron/db/l3_dvrscheduler_db.py#L139 leads to the issue. ** Affects: neutron Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to neutron. https://bugs.launchpad.net/bugs/1830456 Title: dvr router slow response during port update Status in neutron: New Bug description: We are having a distributed router which used by hundreds of virtual machines scattered across around 150 compute nodes. When nova sends port update request to neutron, it will generally taking nearly 4 min to complete. Neutron version is openstack Queens 12.0.5. I found the following log entries printed by neutron-server, 2019-05-25 05:24:16,285.285 11834 INFO neutron.wsgi [req- x - default default] x.x.x.x "PUT /v2.0/ports/8c252d91-741a-4627-9600-916d1da5178f HTTP/1.1" status: 200 len: 0 time: 233.6103470 You can see it takes around 240 seconds to finish request. Right now I am suspecting this code snippet https://github.com/openstack/neutron/blob/de59a21754747335d0d9d26082c7f0df105a30c9/neutron/db/l3_dvrscheduler_db.py#L139 leads to the issue. To manage notifications about this bug go to: https://bugs.launchpad.net/neutron/+bug/1830456/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp