[Yahoo-eng-team] [Bug 2007635] Re: ask for large-scale deployment help
This is not a bug. I'm closing it. You can find more information about large deployments in the the Large Scale SIG. https://docs.openstack.org/large-scale/journey/index.html ** Changed in: nova Status: New => Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/2007635 Title: ask for large-scale deployment help Status in OpenStack Compute (nova): Invalid Bug description: hello, all, this is not a bug, I want some help. I am new for openstack, I want to find some information about official large-scale openstack deployment and pressure test indicator data, for example, how many compute nodes and vms can a sigle openstack region (not use nova cell and only three controller node) have at most, how manay vms can be created/stop/start/migrate at a same time? Is there any official data about these, or any where I can find these information, Thank you for help!! To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/2007635/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1951617] [NEW] "Quota exceeded" message is confusing for "resize"
Public bug reported: "Quota exceeded" message is confusing for "resize" When trying to create an instance and there is no quota available, the user gets an error message. example: "Quota exceeded for cores: Requested 1, but already used 100 of 100 cores (HTTP 403)" The user can see that the project is already using 100 vCPUs out of 100 vCPUs available (vCPU quota) in the project. However, if he tries to resize an instance we can get a similar error message: "Quota exceeded for cores: Requested 2, but already used 42 of 100 cores (HTTP 403)" So, this has a completely different meaning! It means that the user (of the instance that he's trying to resize) is using 42 vCPUs in the project out of 100 cores allowed by the quota. This is hard to understand for a end user. When naively reading this message looks like the project still has plenty of resources for the resize. I believe this comes from the time when Nova allowed quotas per user. In my opinion this distinction shouldn't be done anymore. As mentioned we don't do it when creating a new instance. +++ This was tested with the master branch (19/11/2021) ** Affects: nova Importance: Undecided Assignee: Belmiro Moreira (moreira-belmiro-email-lists) Status: New ** Changed in: nova Assignee: (unassigned) => Belmiro Moreira (moreira-belmiro-email-lists) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1951617 Title: "Quota exceeded" message is confusing for "resize" Status in OpenStack Compute (nova): New Bug description: "Quota exceeded" message is confusing for "resize" When trying to create an instance and there is no quota available, the user gets an error message. example: "Quota exceeded for cores: Requested 1, but already used 100 of 100 cores (HTTP 403)" The user can see that the project is already using 100 vCPUs out of 100 vCPUs available (vCPU quota) in the project. However, if he tries to resize an instance we can get a similar error message: "Quota exceeded for cores: Requested 2, but already used 42 of 100 cores (HTTP 403)" So, this has a completely different meaning! It means that the user (of the instance that he's trying to resize) is using 42 vCPUs in the project out of 100 cores allowed by the quota. This is hard to understand for a end user. When naively reading this message looks like the project still has plenty of resources for the resize. I believe this comes from the time when Nova allowed quotas per user. In my opinion this distinction shouldn't be done anymore. As mentioned we don't do it when creating a new instance. +++ This was tested with the master branch (19/11/2021) To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1951617/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1947753] [NEW] Evacuated instances are not removed from the source
Public bug reported: Instance "evacuation" is a great feature and we are trying to take advantage of it. But, it has some limitations, depending how "broken" is the node. Let me give some context... In the scenario where the compute node loses connectivity (broken switch port, loose network cable, ...) or nova-compute is suck (filesystem issue) evacuating instances can have some unexpected consequences and lead to data corruption in the application (for example in a DB application). If a compute node loses connectivity (or an entire set of compute nodes), nova-compute and the instances are "not available". If the node runs critical applications (let's suppose a MySQL DB), the cloud operator could be tempted to "evacuate" the instance to recover the critical application for the user. At this point the cloud operator may not know yet the compute node issue and maybe it won't be possible to shut it down (management network affected?, ...) or even simply don't want to interfere with the work of the repair team. The repair teams fixes the issue (it can take few minutes or hours...) and nova-compute and the instances are available again. The problem is that nova-compute doesn't destroy the evacuated instances in the source. ``` 2021-10-19 11:17:51.519 3050 WARNING nova.compute.resource_tracker [req-0ed10e35-2715-466a-918b-69eb1fc770e8 - - - - -] Instance fc3be091-56d3-4c69-8adb-2fdb8b0a35d2 has been moved to another host foo.cern.ch(foo.cern.ch). There are allocations remaining against the source host that might need to be removed: {u'resources': {u'VCPU': 1, u'MEMORY_MB': 1875}}. ``` At this point we have 2 instances sharing the same IP and possibly writing into the same volume. Only when nova-compute is restarted (I guess that was always the assumption... the compute node was really broken) the evacuated instances in the affected node are removed. ``` 2021-10-19 15:39:49.257 21189 INFO nova.compute.manager [req-ded45b0c-20ab-4587-9533-8c613d977f79 - - - - -] Destroying instance as it has been evacuated from this host but still exists in the hypervisor 2021-10-19 15:39:52.949 21189 INFO nova.virt.libvirt.driver [ ] Instance destroyed successfully. ``` I would expect that nova-compute will constantly check for the evacuated instances and then removed them. Otherwise, this requires a lot of coordination between different support teams. Should this be moved to a periodic task? https://github.com/openstack/nova/blob/e14eef0719eceef35e7e96b3e3d242ec79a80969/nova/compute/manager.py#L1440 I'm running Stein, but looking into the code, we have the same behaviour in master. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1947753 Title: Evacuated instances are not removed from the source Status in OpenStack Compute (nova): New Bug description: Instance "evacuation" is a great feature and we are trying to take advantage of it. But, it has some limitations, depending how "broken" is the node. Let me give some context... In the scenario where the compute node loses connectivity (broken switch port, loose network cable, ...) or nova-compute is suck (filesystem issue) evacuating instances can have some unexpected consequences and lead to data corruption in the application (for example in a DB application). If a compute node loses connectivity (or an entire set of compute nodes), nova-compute and the instances are "not available". If the node runs critical applications (let's suppose a MySQL DB), the cloud operator could be tempted to "evacuate" the instance to recover the critical application for the user. At this point the cloud operator may not know yet the compute node issue and maybe it won't be possible to shut it down (management network affected?, ...) or even simply don't want to interfere with the work of the repair team. The repair teams fixes the issue (it can take few minutes or hours...) and nova-compute and the instances are available again. The problem is that nova-compute doesn't destroy the evacuated instances in the source. ``` 2021-10-19 11:17:51.519 3050 WARNING nova.compute.resource_tracker [req-0ed10e35-2715-466a-918b-69eb1fc770e8 - - - - -] Instance fc3be091-56d3-4c69-8adb-2fdb8b0a35d2 has been moved to another host foo.cern.ch(foo.cern.ch). There are allocations remaining against the source host that might need to be removed: {u'resources': {u'VCPU': 1, u'MEMORY_MB': 1875}}. ``` At this point we have 2 instances sharing the same IP and possibly writing into the same volume. Only when nova-compute is restarted (I guess that was always the assumption... the compute node was really broken) the evacuated instances in the affected node are removed. ``` 2021-10-19 15:39:49.257 21189 INFO nova.compute.manager
[Yahoo-eng-team] [Bug 1933955] [NEW] Power sync using the Ironic driver queries all the nodes from Ironic when using Conductor Groups
Public bug reported: """ While synchronizing instance power states, found 447 instances in the database and 8712 instances on the hypervisor. """ This is the warning message that we get when using conductor groups during a power sync. Conductor groups allow to have dedicated nova-compute nodes to manage a set of Ironic nodes. However, the "_sync_power_states" doesn't deal correctly with it. First, this function gets all the nodes from the DB that are managed by the Nova compute node. Then it asks the "driver" to get all the instances. When using the Ironic driver, it returns all the nodes in Ironic! (When having thousands of nodes Ironic can also get several minutes to return, but that is a different bug)!. Of course, then the comparison fails, returning the previous warn message. There are different possibilities... - We can change the ironic driver to return only the nodes from the conductor group that this Nova compute-node belongs. However, this is not good enough if the conductor group is managed by more than 1 Nova compute-node. Ironic doesn't know which Nova compute-node manages each node! - We agree that this check doesn't bring a lot of value when using the Ironic driver. We just skip it if the Ironic driver is used. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1933955 Title: Power sync using the Ironic driver queries all the nodes from Ironic when using Conductor Groups Status in OpenStack Compute (nova): New Bug description: """ While synchronizing instance power states, found 447 instances in the database and 8712 instances on the hypervisor. """ This is the warning message that we get when using conductor groups during a power sync. Conductor groups allow to have dedicated nova-compute nodes to manage a set of Ironic nodes. However, the "_sync_power_states" doesn't deal correctly with it. First, this function gets all the nodes from the DB that are managed by the Nova compute node. Then it asks the "driver" to get all the instances. When using the Ironic driver, it returns all the nodes in Ironic! (When having thousands of nodes Ironic can also get several minutes to return, but that is a different bug)!. Of course, then the comparison fails, returning the previous warn message. There are different possibilities... - We can change the ironic driver to return only the nodes from the conductor group that this Nova compute-node belongs. However, this is not good enough if the conductor group is managed by more than 1 Nova compute-node. Ironic doesn't know which Nova compute-node manages each node! - We agree that this check doesn't bring a lot of value when using the Ironic driver. We just skip it if the Ironic driver is used. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1933955/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1927740] [NEW] Ironic driver persistent warn msg when running only a node per conductor group
Public bug reported: ``` 2021-05-07 13:55:12.570 3142 WARNING nova.virt.ironic.driver [req-bcca8fbe-3293-4d85-a3a3-a07328d91c17 - - - - -] This compute service (XXX) is the only service present in the [ironic]/peer_list option. Are you sure this should not include more hosts? ``` The decision about the number of compute nodes behind each conductor group depends in the deployment architecture and risk tolerance. For deployments that decided to only run one compute node per conductor group they get the above msg in the logs every periodic task cycle. It's good that Nova points that this can be an issue, but the frequency really "pollutes" the logs for all the operators that made a conscious decision. I propose to move the log level from warn to debug. In debug operators will continue to have this message. Usually operators run in debug mode when debugging issues or in the deployment phase. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1927740 Title: Ironic driver persistent warn msg when running only a node per conductor group Status in OpenStack Compute (nova): New Bug description: ``` 2021-05-07 13:55:12.570 3142 WARNING nova.virt.ironic.driver [req-bcca8fbe-3293-4d85-a3a3-a07328d91c17 - - - - -] This compute service (XXX) is the only service present in the [ironic]/peer_list option. Are you sure this should not include more hosts? ``` The decision about the number of compute nodes behind each conductor group depends in the deployment architecture and risk tolerance. For deployments that decided to only run one compute node per conductor group they get the above msg in the logs every periodic task cycle. It's good that Nova points that this can be an issue, but the frequency really "pollutes" the logs for all the operators that made a conscious decision. I propose to move the log level from warn to debug. In debug operators will continue to have this message. Usually operators run in debug mode when debugging issues or in the deployment phase. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1927740/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1924612] [NEW] Can't list "killed" images using the CLI
Public bug reported: Doing a DB clean up I noticed that we have several images in "killed" state. But using the CLI I wasn't able to list them. However, when the image_id is known the details can be shown and they can be deleted. If an user can't list "killed" images, he doesn't know that those images belong to his project so they can't be deleted. Mostly is "cosmetics" but it would be good to clean them. Talking with abhishekk on IRC, he suggests to try: "glance image-list --property-filter status=killed" It doesn't work in Ussuri release. ** Affects: glance Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to Glance. https://bugs.launchpad.net/bugs/1924612 Title: Can't list "killed" images using the CLI Status in Glance: New Bug description: Doing a DB clean up I noticed that we have several images in "killed" state. But using the CLI I wasn't able to list them. However, when the image_id is known the details can be shown and they can be deleted. If an user can't list "killed" images, he doesn't know that those images belong to his project so they can't be deleted. Mostly is "cosmetics" but it would be good to clean them. Talking with abhishekk on IRC, he suggests to try: "glance image-list --property-filter status=killed" It doesn't work in Ussuri release. To manage notifications about this bug go to: https://bugs.launchpad.net/glance/+bug/1924612/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1924585] [NEW] Live Migration - if libvirt timeout the instance goes to error state but the live migration continues
Public bug reported: Recently we live migrated an entire cell to new hardware and we hit the following problem several times... During a live migration Nova monitors the state of the migration quering libvirt every 0.5s https://github.com/openstack/nova/blob/5eab13030bc2708c8900f7ac1bdbc8a111f5f823/nova/virt/libvirt/driver.py#L9452 If libvirt timeout, the instance is left in a very bad state... The instance goes to error state. For Nova the instance continues in the source compute node. However, libvirt continues with the live migration, that will eventually end up the the destination compute node. I'm using Stein release, but looking into the current release the code path seems the same. Here's the Stein trace: ``` Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6796, in _do_live_migration block_migration, migrate_data) File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7581, in live_migration migrate_data) File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 8068, in _live_migration finish_event, disk_paths) File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7873, in _live_migration_monitor info = guest.get_job_info() File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 705, in get_job_info stats = self._domain.jobStats() File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 190, in doit result = proxy_call(self._autowrap, f, *args, **kwargs) File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 148, in proxy_call rv = execute(f, *args, **kwargs) File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 129, in execute six.reraise(c, e, tb) File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 83, in tworker rv = meth(*args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1433, in jobStats if ret is None: raise libvirtError ('virDomainGetJobStats() failed', dom=self) libvirtError: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMemoryStats) ``` ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1924585 Title: Live Migration - if libvirt timeout the instance goes to error state but the live migration continues Status in OpenStack Compute (nova): New Bug description: Recently we live migrated an entire cell to new hardware and we hit the following problem several times... During a live migration Nova monitors the state of the migration quering libvirt every 0.5s https://github.com/openstack/nova/blob/5eab13030bc2708c8900f7ac1bdbc8a111f5f823/nova/virt/libvirt/driver.py#L9452 If libvirt timeout, the instance is left in a very bad state... The instance goes to error state. For Nova the instance continues in the source compute node. However, libvirt continues with the live migration, that will eventually end up the the destination compute node. I'm using Stein release, but looking into the current release the code path seems the same. Here's the Stein trace: ``` Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6796, in _do_live_migration block_migration, migrate_data) File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7581, in live_migration migrate_data) File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 8068, in _live_migration finish_event, disk_paths) File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7873, in _live_migration_monitor info = guest.get_job_info() File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 705, in get_job_info stats = self._domain.jobStats() File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 190, in doit result = proxy_call(self._autowrap, f, *args, **kwargs) File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 148, in proxy_call rv = execute(f, *args, **kwargs) File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 129, in execute six.reraise(c, e, tb) File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 83, in tworker rv = meth(*args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1433, in jobStats if ret is None: raise libvirtError ('virDomainGetJobStats() failed', dom=self) libvirtError: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMemoryStats) ``` To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1924585/+subscriptions --
[Yahoo-eng-team] [Bug 1924123] [NEW] If source compute node is overcommitted instances can't be migrated
Public bug reported: I'm facing a similar issue to "https://bugs.launchpad.net/nova/+bug/1918419; but somehow different which makes me open a new bug. I'm giving some context to this bug to better explain how this affects operations. Here's the story... When a compute node needs a hardware intervention we have an automated process that the repair team uses (they don't have access to OpenStack APIs) to live migrate all the instances before starting the repair. The motivation is to minimize the impact on users. However, instances can't be live migrated if the compute node becomes overcommitted! It happens that if a DIMM fails in a compute node that has all the memory allocated to VMs, it's not possible to move those VMs. "No valid host was found. Unable to replace instance claim on source (HTTP 400)" The compute node becomes overcommitted (because the DIMM is not visible anymore) and placement can't create the migration allocation in the source. The operator can workaround and "tune" the memory overcommit for the affected compute node, but that requires investigation and a manual intervention of an operator defeating automation and delegation to other teams. Extremely complicated in large deployments. I don't believe this behaviour is correct. If there are available resources to host the instances in a different compute node, placement shouldn't block the live migration because the source is overcommitted. +++ Using Nova Stein. For what I checked looks it's still the behaviour in recent releases. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1924123 Title: If source compute node is overcommitted instances can't be migrated Status in OpenStack Compute (nova): New Bug description: I'm facing a similar issue to "https://bugs.launchpad.net/nova/+bug/1918419; but somehow different which makes me open a new bug. I'm giving some context to this bug to better explain how this affects operations. Here's the story... When a compute node needs a hardware intervention we have an automated process that the repair team uses (they don't have access to OpenStack APIs) to live migrate all the instances before starting the repair. The motivation is to minimize the impact on users. However, instances can't be live migrated if the compute node becomes overcommitted! It happens that if a DIMM fails in a compute node that has all the memory allocated to VMs, it's not possible to move those VMs. "No valid host was found. Unable to replace instance claim on source (HTTP 400)" The compute node becomes overcommitted (because the DIMM is not visible anymore) and placement can't create the migration allocation in the source. The operator can workaround and "tune" the memory overcommit for the affected compute node, but that requires investigation and a manual intervention of an operator defeating automation and delegation to other teams. Extremely complicated in large deployments. I don't believe this behaviour is correct. If there are available resources to host the instances in a different compute node, placement shouldn't block the live migration because the source is overcommitted. +++ Using Nova Stein. For what I checked looks it's still the behaviour in recent releases. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1924123/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1918419] [NEW] vCPU resource max_unit is hardcoded
Public bug reported: Becasue the spectre/meltdown vulnerabilities (2018) we needed to disable SMT in all public facing compute nodes. As result the number of available cores was reduced by half. We had flavors available with 32vCPUs that couldn't be used anymore because placement max_unit for vCPUs is hardcoded to be the total number of cpus regardless the allocation_ratio. To me it's a sensible default but doesn't offer any flexibility for operators. See the IRC discussion at that time: http://eavesdrop.openstack.org/irclogs/%23openstack-placement/%23openstack-placement.2018-09-20.log.html As conclusion, we informed the users that we couldn't offer those flavors anymore. The old VMs (that were created before disabling SMT) continued to run without any issue. So... after ~2 year I'm hitting again this problem :) These compute nodes need now to be retired and we are live migrating all the instances to the replacement hardware. When trying to live migrate these instances (vCPUs > max_unit) it fails, becasue the migration allocation can't be created against the source compute node. For the new hardware (dest_compute) the vCPUS < max_unit, so no issue for the new allocation. I'm working around this problem (to live migrate the instances), patching the code to have a higher max_unit for vCPUs in the compute nodes hosting these instances. I feel that this issue should be discussed again and consider the possibility to configure the max_unit value. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1918419 Title: vCPU resource max_unit is hardcoded Status in OpenStack Compute (nova): New Bug description: Becasue the spectre/meltdown vulnerabilities (2018) we needed to disable SMT in all public facing compute nodes. As result the number of available cores was reduced by half. We had flavors available with 32vCPUs that couldn't be used anymore because placement max_unit for vCPUs is hardcoded to be the total number of cpus regardless the allocation_ratio. To me it's a sensible default but doesn't offer any flexibility for operators. See the IRC discussion at that time: http://eavesdrop.openstack.org/irclogs/%23openstack-placement/%23openstack-placement.2018-09-20.log.html As conclusion, we informed the users that we couldn't offer those flavors anymore. The old VMs (that were created before disabling SMT) continued to run without any issue. So... after ~2 year I'm hitting again this problem :) These compute nodes need now to be retired and we are live migrating all the instances to the replacement hardware. When trying to live migrate these instances (vCPUs > max_unit) it fails, becasue the migration allocation can't be created against the source compute node. For the new hardware (dest_compute) the vCPUS < max_unit, so no issue for the new allocation. I'm working around this problem (to live migrate the instances), patching the code to have a higher max_unit for vCPUs in the compute nodes hosting these instances. I feel that this issue should be discussed again and consider the possibility to configure the max_unit value. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1918419/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1917645] [NEW] Nova can't create instances if RabbitMQ notification cluster is down
Public bug reported: We use independent RabbitMQ clusters for each OpenStack project, Nova Cells and also for notifications. Recently, I noticed in our test infrastructure that if the RabbitMQ cluster for notifications has an outage, Nova can't create new instances. Possibly other operations will also hang. Not being able to send a notification/connect to the RabbitMQ cluster shouldn't stop new instances to be created. (If this is actually an use- case for some deployments, the operator should have the possibility to configure it.) Tested against the master branch. If the notification RabbitMQ is stooped, when creating an instance, nova-scheduler is stuck with: ``` Mar 01 21:16:28 devstack nova-scheduler[18384]: DEBUG nova.scheduler.request_filter [None req-353318d1-f4bd-499d-98db-a0919d28ecf7 demo demo] Request filter 'accelerators_filter' took 0.0 seconds {{(pid=18384) wrapper /opt/stack/nova/nova/scheduler/request_filter.py:46}} Mar 01 21:16:32 devstack nova-scheduler[18384]: ERROR oslo.messaging._drivers.impl_rabbit [None req-353318d1-f4bd-499d-98db-a0919d28ecf7 demo demo] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 2.0 seconds): OSError: [Errno 113] EHOSTUNREACH Mar 01 21:16:35 devstack nova-scheduler[18384]: ERROR oslo.messaging._drivers.impl_rabbit [None req-353318d1-f4bd-499d-98db-a0919d28ecf7 demo demo] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 4.0 seconds): OSError: [Errno 113] EHOSTUNREACH Mar 01 21:16:42 devstack nova-scheduler[18384]: ERROR oslo.messaging._drivers.impl_rabbit [None req-353318d1-f4bd-499d-98db-a0919d28ecf7 demo demo] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 6.0 seconds): OSError: [Errno 113] EHOSTUNREACH Mar 01 21:16:51 devstack nova-scheduler[18384]: ERROR oslo.messaging._drivers.impl_rabbit [None req-353318d1-f4bd-499d-98db-a0919d28ecf7 demo demo] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 8.0 seconds): OSError: [Errno 113] EHOSTUNREACH Mar 01 21:17:02 devstack nova-scheduler[18384]: ERROR oslo.messaging._drivers.impl_rabbit [None req-353318d1-f4bd-499d-98db-a0919d28ecf7 demo demo] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 10.0 seconds): OSError: [Errno 113] EHOSTUNREACH (...) ``` Because the notification RabbitMQ cluster is down, Nova gets stuck in: https://github.com/openstack/nova/blob/5b66caab870558b8a7f7b662c01587b959ad3d41/nova/scheduler/filter_scheduler.py#L85 because oslo messaging never gives up: https://github.com/openstack/oslo.messaging/blob/5aa645b38b4c1cf08b00e687eb6c7c4b8a0211fc/oslo_messaging/_drivers/impl_rabbit.py#L736 ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1917645 Title: Nova can't create instances if RabbitMQ notification cluster is down Status in OpenStack Compute (nova): New Bug description: We use independent RabbitMQ clusters for each OpenStack project, Nova Cells and also for notifications. Recently, I noticed in our test infrastructure that if the RabbitMQ cluster for notifications has an outage, Nova can't create new instances. Possibly other operations will also hang. Not being able to send a notification/connect to the RabbitMQ cluster shouldn't stop new instances to be created. (If this is actually an use-case for some deployments, the operator should have the possibility to configure it.) Tested against the master branch. If the notification RabbitMQ is stooped, when creating an instance, nova-scheduler is stuck with: ``` Mar 01 21:16:28 devstack nova-scheduler[18384]: DEBUG nova.scheduler.request_filter [None req-353318d1-f4bd-499d-98db-a0919d28ecf7 demo demo] Request filter 'accelerators_filter' took 0.0 seconds {{(pid=18384) wrapper /opt/stack/nova/nova/scheduler/request_filter.py:46}} Mar 01 21:16:32 devstack nova-scheduler[18384]: ERROR oslo.messaging._drivers.impl_rabbit [None req-353318d1-f4bd-499d-98db-a0919d28ecf7 demo demo] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 2.0 seconds): OSError: [Errno 113] EHOSTUNREACH Mar 01 21:16:35 devstack nova-scheduler[18384]: ERROR oslo.messaging._drivers.impl_rabbit [None req-353318d1-f4bd-499d-98db-a0919d28ecf7 demo demo] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 4.0 seconds): OSError: [Errno 113] EHOSTUNREACH Mar 01 21:16:42 devstack nova-scheduler[18384]: ERROR oslo.messaging._drivers.impl_rabbit [None req-353318d1-f4bd-499d-98db-a0919d28ecf7 demo demo] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 6.0 seconds): OSError: [Errno 113] EHOSTUNREACH Mar 01 21:16:51 devstack nova-scheduler[18384]: ERROR oslo.messaging._drivers.impl_rabbit [None req-353318d1-f4bd-499d-98db-a0919d28ecf7 demo demo] Connection failed: [Errno 113] EHOSTUNREACH (retrying in 8.0 seconds): OSError: [Errno 113] EHOSTUNREACH Mar
[Yahoo-eng-team] [Bug 1916031] [NEW] Wrong elapsed time logged during a live migration
Public bug reported: In a recent VM live migration I noticed that the migration time reported in the logs was not consistent with the actual time that it was taking: ``` 2021-01-15 09:51:07.41 43553 INFO nova.virt.libvirt.driver [ ] Migration running for 0 secs, memory 100% remaining; (bytes processed=0, remaining=0, total=0) 2021-01-15 09:52:37.740 43553 DEBUG nova.virt.libvirt.driver [ ] Migration running for 5 secs, memory 100% remaining; (bytes processed=0, remaining=0, total=0) 2021-01-15 09:53:34.574 43553 DEBUG nova.virt.libvirt.driver [ ] Migration running for 10 secs, memory 100% remaining; (bytes processed=0, remaining=0, total=0) 2021-01-15 09:54:21.186 43553 DEBUG nova.virt.libvirt.driver [ ] Migration running for 15 secs, memory 100% remaining; (bytes processed=0, remaining=0, total=0) (...) ``` This is because Nova doesn’t log the actual time that is taking. It cycles to check the migration job status every 500ms and it logs the number of cycles/2. Nova assumes that libvirt calls will report immediately, which was not the case. (In this particular example the compute node had issues and libvirt calls were taking a few seconds). This behavior can cause some confusion when operators are debugging issues. In my opinion Nova should log the real migration time. ** Affects: nova Importance: Undecided Assignee: Belmiro Moreira (moreira-belmiro-email-lists) Status: New ** Changed in: nova Assignee: (unassigned) => Belmiro Moreira (moreira-belmiro-email-lists) ** Description changed: - In a recent VM live migration I noticed that the migration time reported in + In a recent VM live migration I noticed that the migration time reported in the logs was not consistent with the actual time that it was taking: ``` 2021-01-15 09:51:07.41 43553 INFO nova.virt.libvirt.driver [ ] Migration running for 0 secs, memory 100% remaining; (bytes processed=0, remaining=0, total=0) 2021-01-15 09:52:37.740 43553 DEBUG nova.virt.libvirt.driver [ ] Migration running for 5 secs, memory 100% remaining; (bytes processed=0, remaining=0, total=0) 2021-01-15 09:53:34.574 43553 DEBUG nova.virt.libvirt.driver [ ] Migration running for 10 secs, memory 100% remaining; (bytes processed=0, remaining=0, total=0) 2021-01-15 09:54:21.186 43553 DEBUG nova.virt.libvirt.driver [ ] Migration running for 15 secs, memory 100% remaining; (bytes processed=0, remaining=0, total=0) (...) ``` - This is because Nova doesn’t log the actual time that is taking. It cycles + This is because Nova doesn’t log the actual time that is taking. It cycles to check the migration job status every 500ms and it logs the number of cycles/2. - Nova assumes that libvirt calls will report immediately, which was not the case. - (In this particular example the compute node had issues and libvirt calls were - taking a few seconds). + Nova assumes that libvirt calls will report immediately, which was not + the case. (In this particular example the compute node had issues and + libvirt calls were taking a few seconds). - This behavior can cause some confusion when operators are debugging issues. + This behavior can cause some confusion when operators are debugging + issues. + In my opinion Nova should log the real migration time. -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1916031 Title: Wrong elapsed time logged during a live migration Status in OpenStack Compute (nova): New Bug description: In a recent VM live migration I noticed that the migration time reported in the logs was not consistent with the actual time that it was taking: ``` 2021-01-15 09:51:07.41 43553 INFO nova.virt.libvirt.driver [ ] Migration running for 0 secs, memory 100% remaining; (bytes processed=0, remaining=0, total=0) 2021-01-15 09:52:37.740 43553 DEBUG nova.virt.libvirt.driver [ ] Migration running for 5 secs, memory 100% remaining; (bytes processed=0, remaining=0, total=0) 2021-01-15 09:53:34.574 43553 DEBUG nova.virt.libvirt.driver [ ] Migration running for 10 secs, memory 100% remaining; (bytes processed=0, remaining=0, total=0) 2021-01-15 09:54:21.186 43553 DEBUG nova.virt.libvirt.driver [ ] Migration running for 15 secs, memory 100% remaining; (bytes processed=0, remaining=0, total=0) (...) ``` This is because Nova doesn’t log the actual time that is taking. It cycles to check the migration job status every 500ms and it logs the number of cycles/2. Nova assumes that libvirt calls will report immediately, which was not the case. (In this particular example the compute node had issues and libvirt calls were taking a few seconds). This behavior can cause some confusion when operators are debugging issues. In my opinion Nova should log the real migration time. To manage notifications about this bug go to: ht
[Yahoo-eng-team] [Bug 1902216] [NEW] Can't define a cpu_model from a different architecture
Public bug reported: """ It would be great if Nova supports instances with a different architecture than the host. My use case is to be run aarch64 guests in a x86_64 compute node. """ In order to create an aarch64 guest in an x86_64 compute node we need to define the emulated CPU. However, Nova doesn't allow to define a CPU model that doesn't match with the host architecture. For example: CONF.libvirt.virt_type=qemu CONF.libvirt.cpu_model=cortex-a57 CONF.libvirt.cpu_mode=custom It fails with: nova.exception.InvalidCPUInfo: Configured CPU model: cortex-a57 is not correct, or your host CPU arch does not support this model. Please correct your config and try again. The problem is related with the this nova check in driver.py: if cpu_info['arch'] not in (fields.Architecture.I686, fields.Architecture.X86_64, fields.Architecture.PPC64, fields.Architecture.PPC64LE, fields.Architecture.PPC): return model Again, it's relying the host architecture for the x86_64. Environment === Tested using the master branch (29/10/2020) Other = I'm now opening target bugs for the generic issue reported in https://bugs.launchpad.net/nova/+bug/1863728 ** Affects: nova Importance: Undecided Assignee: Belmiro Moreira (moreira-belmiro-email-lists) Status: New ** Changed in: nova Assignee: (unassigned) => Belmiro Moreira (moreira-belmiro-email-lists) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1902216 Title: Can't define a cpu_model from a different architecture Status in OpenStack Compute (nova): New Bug description: """ It would be great if Nova supports instances with a different architecture than the host. My use case is to be run aarch64 guests in a x86_64 compute node. """ In order to create an aarch64 guest in an x86_64 compute node we need to define the emulated CPU. However, Nova doesn't allow to define a CPU model that doesn't match with the host architecture. For example: CONF.libvirt.virt_type=qemu CONF.libvirt.cpu_model=cortex-a57 CONF.libvirt.cpu_mode=custom It fails with: nova.exception.InvalidCPUInfo: Configured CPU model: cortex-a57 is not correct, or your host CPU arch does not support this model. Please correct your config and try again. The problem is related with the this nova check in driver.py: if cpu_info['arch'] not in (fields.Architecture.I686, fields.Architecture.X86_64, fields.Architecture.PPC64, fields.Architecture.PPC64LE, fields.Architecture.PPC): return model Again, it's relying the host architecture for the x86_64. Environment === Tested using the master branch (29/10/2020) Other = I'm now opening target bugs for the generic issue reported in https://bugs.launchpad.net/nova/+bug/1863728 To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1902216/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1902205] [NEW] UEFI loader should consider the guest architecture not the host
Public bug reported: """ It would be great if Nova supports instances with a different architecture than the host. An use case would be run aarch64 guests in a x86_64 compute node. """ In order to use boot an aarch64 guest in a x86_64 host we need to use UEFI. However, Nova always uses the UEFI loader considering the host architecture. The guest architecture should be considered instead. in livbvirt.driver.py: "for lpath in DEFAULT_UEFI_LOADER_PATH[caps.host.cpu.arch]" Environment === Tested using the master branch (29/10/2020) Other = I'm now opening target bugs for this issue. It was first reported has a generic bug in https://bugs.launchpad.net/nova/+bug/1863728 ** Affects: nova Importance: Undecided Assignee: Belmiro Moreira (moreira-belmiro-email-lists) Status: New ** Description changed: """ It would be great if Nova supports instances with a different architecture than the host. An use case would be run aarch64 guests in a x86_64 compute node. """ In order to use boot an aarch64 guest in a x86_64 host we need to use UEFI. However, Nova always uses the UEFI loader considering the host architecture. The guest architecture should be considered instead. in livbvirt.driver.py: "for lpath in DEFAULT_UEFI_LOADER_PATH[caps.host.cpu.arch]" Environment === Tested using the master branch (29/10/2020) Other = I'm now opening target bugs for this issue. - It was first reported has a generic bug in https://bugs.launchpad + It was first reported has a generic bug in https://bugs.launchpad.net/nova/+bug/1863728 ** Changed in: nova Assignee: (unassigned) => Belmiro Moreira (moreira-belmiro-email-lists) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1902205 Title: UEFI loader should consider the guest architecture not the host Status in OpenStack Compute (nova): New Bug description: """ It would be great if Nova supports instances with a different architecture than the host. An use case would be run aarch64 guests in a x86_64 compute node. """ In order to use boot an aarch64 guest in a x86_64 host we need to use UEFI. However, Nova always uses the UEFI loader considering the host architecture. The guest architecture should be considered instead. in livbvirt.driver.py: "for lpath in DEFAULT_UEFI_LOADER_PATH[caps.host.cpu.arch]" Environment === Tested using the master branch (29/10/2020) Other = I'm now opening target bugs for this issue. It was first reported has a generic bug in https://bugs.launchpad.net/nova/+bug/1863728 To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1902205/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1902203] [NEW] Instance architecture should be reflected in the instance domain
Public bug reported: """ It would be great if Nova supports instances with a different architecture than the host. An use case would be run aarch64 guests in a x86_64 compute node. """ The issue is that nova always uses the architecture from the host when defining the instance domain and not what's defined in the image architecture. Also, because of this the emulator is not correctly defined. Almost all the pieces are already there! - CONF.libvirt.hw_machine_type / or using the instance metadata (it's defined as expected in the instance domain, I'm using "virt-4.0") - CONF.libvirt.virt_type (it's defined as expected in the instance domain, I'm using "qemu") - Defined the image architecture to "aarch64". Actually Nova reads this property from the image but doesn't use it. === The instance creation fails because: Nova only uses in the domain definition: hvm and then libvirt uses the host architecture in the domain definition. In my case this results in using the x86_64 emulator. When hardcoding the right architecture in the guest.os_mach_type it works as expected. if self.os_mach_type is not None: type_node.set("arch", 'aarch64') type_node.set("machine", self.os_mach_type) The domain is created correctly: hvm /usr/bin/qemu-system-aarch64 Environment === Tested using the master branch (29/10/2020) Other = I'm now opening target bugs for this issue. It was first reported has a generic bug in https://bugs.launchpad.net/nova/+bug/1863728 ** Affects: nova Importance: Undecided Status: New ** Description changed: """ It would be great if Nova supports instances with a different architecture than the host. An use case would be run aarch64 guests in a x86_64 compute node. """ The issue is that nova always uses the architecture from the host when defining the instance domain and not what's defined in the image architecture. Also, because of this the emulator is not correctly defined. Almost all the pieces are already there! - - CONF.libvirt.hw_machine_type / or using the instance metadata + - CONF.libvirt.hw_machine_type / or using the instance metadata (it's defined as expected in the instance domain, I'm using "virt-4.0") - - CONF.libvirt.virt_type + - CONF.libvirt.virt_type (it's defined as expected in the instance domain, I'm using "qemu") - Defined the image architecture to "aarch64". Actually Nova reads this property from the image but doesn't use it. === The instance creation fails because: - Nova only uses in the domain definition: hvm + Nova only uses in the domain definition: + hvm + and then libvirt uses the host architecture in the domain definition. In my case this results in using the x86_64 emulator. When hardcoding the right architecture in the guest.os_mach_type it works as expected. if self.os_mach_type is not None: - type_node.set("arch", 'aarch64') - type_node.set("machine", self.os_mach_type) + type_node.set("arch", 'aarch64') + type_node.set("machine", self.os_mach_type) The domain is created correctly: hvm /usr/bin/qemu-system-aarch64 - Environment === Tested using the master branch (29/10/2020) Other = I'm now opening target bugs for this issue. It was first reported has a generic bug in https://bugs.launchpad.net/nova/+bug/1863728 -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1902203 Title: Instance architecture should be reflected in the instance domain Status in OpenStack Compute (nova): New Bug description: """ It would be great if Nova supports instances with a different architecture than the host. An use case would be run aarch64 guests in a x86_64 compute node. """ The issue is that nova always uses the architecture from the host when defining the instance domain and not what's defined in the image architecture. Also, because of this the emulator is not correctly defined. Almost all the pieces are already there! - CONF.libvirt.hw_machine_type / or using the instance metadata (it's defined as expected in the instance domain, I'm using "virt-4.0") - CONF.libvirt.virt_type (it's defined as expected in the instance domain, I'm using "qemu") - Defined the image architecture to "aarch64". Actually Nova reads this property from the image but doesn't use it. === The instance creation fails because: Nova only uses in the domain definition: hvm and then libvirt uses the host architecture in the domain definition. In my case this results in using the x86_64 emulator. When hardcoding the right architecture in the guest.os_mach_type it works as expected. if self.os_mach_type is not None: type_node.set("arch", 'aarch64') type_node.set("machine", self.os_mach_type) The domain is created correctly: hvm
[Yahoo-eng-team] [Bug 1863728] [NEW] Nova can't create instances for a different arch
Public bug reported: This is more a wish feature than a bug but considering the use cases I'm surprised that it's not supported by nova. *Support to create instances for a different architecture than the host architecture* My use case: Running ARM instances in x86_64 compute nodes. This is not possible because nova always assume the host architecture. Also, there's different assumptions considering the different architectures. Some examples: - cpu_mode for AARC64 is passthrough (not good if trying to emulate). - Nova always checks the cpu_model against the host so is not possible to define an ARM cpu. - architecture image property is not used for defining instance domain (...) This is mostly for discussion and to see if the community is interested in supporting this use case. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1863728 Title: Nova can't create instances for a different arch Status in OpenStack Compute (nova): New Bug description: This is more a wish feature than a bug but considering the use cases I'm surprised that it's not supported by nova. *Support to create instances for a different architecture than the host architecture* My use case: Running ARM instances in x86_64 compute nodes. This is not possible because nova always assume the host architecture. Also, there's different assumptions considering the different architectures. Some examples: - cpu_mode for AARC64 is passthrough (not good if trying to emulate). - Nova always checks the cpu_model against the host so is not possible to define an ARM cpu. - architecture image property is not used for defining instance domain (...) This is mostly for discussion and to see if the community is interested in supporting this use case. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1863728/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1848514] [NEW] Booting from volume providing an image fails
Public bug reported: Trying to create an instance (booting from volume when specifying an image) fails. Running Stein (19.0.1) ### When using: ### nova boot --flavor FLAVOR_ID --block-device source=image,id=IMAGE_ID,dest=volume,size=10,shutdown=preserve,bootindex=0 INSTANCE_NAME ### nova-compute logs: ### Instance failed block device setup Forbidden: Policy doesn't allow volume:update_volume_admin_metadata to be performed. (HTTP 403) (Request-ID: req-875cc6e1-ffe1-45dd-b942-944166c6040a) The full trace: http://paste.openstack.org/raw/784535/ Definitely this is a policy issue! Our cinder policy: "volume:update_volume_admin_metadata": "rule:admin_api" (default) Using an user with admin credentials works as expected! Is this expected? we didn't identified this behaviour previously (before stein) using the same policy for "update_volume_admin_metadata" Found an old similar report: https://bugs.launchpad.net/nova/+bug/1661189 ** Affects: nova Importance: Undecided Assignee: Surya Seetharaman (tssurya) Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1848514 Title: Booting from volume providing an image fails Status in OpenStack Compute (nova): New Bug description: Trying to create an instance (booting from volume when specifying an image) fails. Running Stein (19.0.1) ### When using: ### nova boot --flavor FLAVOR_ID --block-device source=image,id=IMAGE_ID,dest=volume,size=10,shutdown=preserve,bootindex=0 INSTANCE_NAME ### nova-compute logs: ### Instance failed block device setup Forbidden: Policy doesn't allow volume:update_volume_admin_metadata to be performed. (HTTP 403) (Request-ID: req-875cc6e1-ffe1-45dd-b942-944166c6040a) The full trace: http://paste.openstack.org/raw/784535/ Definitely this is a policy issue! Our cinder policy: "volume:update_volume_admin_metadata": "rule:admin_api" (default) Using an user with admin credentials works as expected! Is this expected? we didn't identified this behaviour previously (before stein) using the same policy for "update_volume_admin_metadata" Found an old similar report: https://bugs.launchpad.net/nova/+bug/1661189 To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1848514/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1837200] [NEW] Deleted images info should be obfuscated - OSSN-0075
Public bug reported: Because OSSN-0075 the Cloud Operator may choose to never purge the "images" table. But, regulations/policy may require that deleted data is not kept. For this case the deleted image records need to be obfuscated (except the image id). ** Affects: glance Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to Glance. https://bugs.launchpad.net/bugs/1837200 Title: Deleted images info should be obfuscated - OSSN-0075 Status in Glance: New Bug description: Because OSSN-0075 the Cloud Operator may choose to never purge the "images" table. But, regulations/policy may require that deleted data is not kept. For this case the deleted image records need to be obfuscated (except the image id). To manage notifications about this bug go to: https://bugs.launchpad.net/glance/+bug/1837200/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1817542] [NEW] nova instance-action fails if project_id=NULL
Public bug reported: nova instance-action fails if project_id=NULL Starting in api version 2.62 "an obfuscated hashed host id is returned" To generate the host_id it uses utils.generate_hostid() that uses (in this case) the project_id and the host of the action. However, we can have actions without a user_id/project_id defined. For example, when something happens outside nova API (user shutdown the VM inside the guest OS). In this case we have an action "stop", without a user_id/project_id. When running 2.62 it fails when performing: nova instance-action no issues if using: --os-compute-api-version 2.60 === The trace in nova-api logs: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/nova/api/openstack/wsgi.py", line 801, in wrapped return f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/nova/api/openstack/compute/instance_actions.py", line 169, in show ) for evt in events_raw] File "/usr/lib/python2.7/site-packages/nova/api/openstack/compute/instance_actions.py", line 69, in _format_event project_id) File "/usr/lib/python2.7/site-packages/nova/utils.py", line 1295, in generate_hostid data = (project_id + host).encode('utf-8') TypeError: unsupported operand type(s) for +: 'NoneType' and 'unicode' ** Affects: nova Importance: Undecided Status: New ** Tags: api -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1817542 Title: nova instance-action fails if project_id=NULL Status in OpenStack Compute (nova): New Bug description: nova instance-action fails if project_id=NULL Starting in api version 2.62 "an obfuscated hashed host id is returned" To generate the host_id it uses utils.generate_hostid() that uses (in this case) the project_id and the host of the action. However, we can have actions without a user_id/project_id defined. For example, when something happens outside nova API (user shutdown the VM inside the guest OS). In this case we have an action "stop", without a user_id/project_id. When running 2.62 it fails when performing: nova instance-action no issues if using: --os-compute-api-version 2.60 === The trace in nova-api logs: Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/nova/api/openstack/wsgi.py", line 801, in wrapped return f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/nova/api/openstack/compute/instance_actions.py", line 169, in show ) for evt in events_raw] File "/usr/lib/python2.7/site-packages/nova/api/openstack/compute/instance_actions.py", line 69, in _format_event project_id) File "/usr/lib/python2.7/site-packages/nova/utils.py", line 1295, in generate_hostid data = (project_id + host).encode('utf-8') TypeError: unsupported operand type(s) for +: 'NoneType' and 'unicode' To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1817542/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1816086] [NEW] Resource Tracker performance with Ironic driver
Public bug reported: The problem is in rocky. The resource tracker builds the resource provider tree and it's updated 2 times in "_update_available_resource". With "_init_compute_node" and in the "_update_available_resource" itself. The problem is that the RP tree will contain all the ironic RP and all the tree is flushed to placement (2 times as described above) when the periodic task iterate per Ironic RP. In our case with 1700 ironic nodes, the period task takes: 1700 x (2 x 7s) = ~6h +++ mitigations: - shard nova-compute. Have several nova-computes dedicated to ironic. Most of the current deployments only use 1 nova-compute to avoid resources shuffle/recreation between nova-computes. Several nova-computes will be need to accommodate the load. - why do we need to do the full resource provider tree flush to placement and not only the RP that is being considered? As a work around we are doing this now! ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1816086 Title: Resource Tracker performance with Ironic driver Status in OpenStack Compute (nova): New Bug description: The problem is in rocky. The resource tracker builds the resource provider tree and it's updated 2 times in "_update_available_resource". With "_init_compute_node" and in the "_update_available_resource" itself. The problem is that the RP tree will contain all the ironic RP and all the tree is flushed to placement (2 times as described above) when the periodic task iterate per Ironic RP. In our case with 1700 ironic nodes, the period task takes: 1700 x (2 x 7s) = ~6h +++ mitigations: - shard nova-compute. Have several nova-computes dedicated to ironic. Most of the current deployments only use 1 nova-compute to avoid resources shuffle/recreation between nova-computes. Several nova-computes will be need to accommodate the load. - why do we need to do the full resource provider tree flush to placement and not only the RP that is being considered? As a work around we are doing this now! To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1816086/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1816034] [NEW] Ironic flavor migration and default resource classes
Public bug reported: The Ironic flavor migration to use resource classes happened in Pike/Queens. The flavors and the instances needed to be upgraded with the correct resource class. This was done by an online data migration. Looking into Rocky code: ironic.driver._pike_flavor_migration There is also an offline data migration using nova-manage. These migrations added the node resource class into instance_extra.flavor however I don't see that they also included the default resource classes (VCPU, MEMORY_MB, DISK_GB) set to 0. Looking into Rocky code there is also a TODO in _pike_flavor_migration: "This code can be removed in Queens, and will need to be updated to also alter extra_specs to zero-out the old-style standard resource classes of VCPU, MEMORY_MB, and DISK_GB." Currently all my Ironic instances have the correct node resource class defined, but "old" instances (created before the flavor migration) don't have VCPU, MEMORY_MB, DISK_GB set to 0, in instance_extra.flavor. In Rocky the resource tracker raises the following message: "There was a conflict when trying to complete your request.\n\n Unable to allocate inventory: Inventory for 'VCPU' on resource provider invalid. ", "title": "Conflict" because it tries to update the allocation but the inventory doesn't have vcpu resources. --- As mitigation we now have: "requires_allocation_refresh = False" in the Ironic Driver. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1816034 Title: Ironic flavor migration and default resource classes Status in OpenStack Compute (nova): New Bug description: The Ironic flavor migration to use resource classes happened in Pike/Queens. The flavors and the instances needed to be upgraded with the correct resource class. This was done by an online data migration. Looking into Rocky code: ironic.driver._pike_flavor_migration There is also an offline data migration using nova-manage. These migrations added the node resource class into instance_extra.flavor however I don't see that they also included the default resource classes (VCPU, MEMORY_MB, DISK_GB) set to 0. Looking into Rocky code there is also a TODO in _pike_flavor_migration: "This code can be removed in Queens, and will need to be updated to also alter extra_specs to zero-out the old-style standard resource classes of VCPU, MEMORY_MB, and DISK_GB." Currently all my Ironic instances have the correct node resource class defined, but "old" instances (created before the flavor migration) don't have VCPU, MEMORY_MB, DISK_GB set to 0, in instance_extra.flavor. In Rocky the resource tracker raises the following message: "There was a conflict when trying to complete your request.\n\n Unable to allocate inventory: Inventory for 'VCPU' on resource provider invalid. ", "title": "Conflict" because it tries to update the allocation but the inventory doesn't have vcpu resources. --- As mitigation we now have: "requires_allocation_refresh = False" in the Ironic Driver. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1816034/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1810342] [NEW] API unexpected exception message
Public bug reported: The "API unexpected exception" message tells the user to open a bug in launchpad and attach the log file if possible. Usually an user doesn't have access to API logs and doesn't know about the nuts and bolts of OpenStack. This error message has been confusing some of our users because asks them to not contact the cloud provider support but instead a website that they don't know. I would prefer to have only a simple error message like "API unexpected exception" or instead, a configurable message where the cloud provider can point their users to the correct support page. ** Affects: nova Importance: Undecided Assignee: Belmiro Moreira (moreira-belmiro-email-lists) Status: In Progress ** Changed in: nova Assignee: (unassigned) => Belmiro Moreira (moreira-belmiro-email-lists) ** Description changed: - The "API unexpected exception" message tells the user to open a bug in launchpad - and attach the log file if possible. - Usually an user doesn't have access to API logs and doesn't know about the nuts - and bolts of OpenStack. - This error message has been confusing some of our users because asks them to not - contact the cloud provider support but instead a website that they don't know. + The "API unexpected exception" message tells the user to open a bug in launchpad and attach the log file if possible. + Usually an user doesn't have access to API logs and doesn't know about the nuts and bolts of OpenStack. + This error message has been confusing some of our users because asks them to not contact the cloud provider support but instead a website that they don't know. - I would prefer to have only a simple error message like "API unexpected exception" - or instead, a configurable message where the cloud provider can point their users - to the correct support page. + I would prefer to have only a simple error message like "API unexpected + exception" or instead, a configurable message where the cloud provider + can point their users to the correct support page. -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1810342 Title: API unexpected exception message Status in OpenStack Compute (nova): In Progress Bug description: The "API unexpected exception" message tells the user to open a bug in launchpad and attach the log file if possible. Usually an user doesn't have access to API logs and doesn't know about the nuts and bolts of OpenStack. This error message has been confusing some of our users because asks them to not contact the cloud provider support but instead a website that they don't know. I would prefer to have only a simple error message like "API unexpected exception" or instead, a configurable message where the cloud provider can point their users to the correct support page. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1810342/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1810340] [NEW] Repetitive info messages from nova-compute
Public bug reported: There are 2 repetitive info messages from nova-compute: INFO nova.compute.resource_tracker Final resource view: INFO nova.virt.libvirt.driver Libvirt baseline CPU By default they are logged every minute. In my view they should be "debug" messages. In large infrastructures that store log files for analytics they use significant storage space without bringing reasonable value. ** Affects: nova Importance: Undecided Assignee: Belmiro Moreira (moreira-belmiro-email-lists) Status: In Progress ** Changed in: nova Assignee: (unassigned) => Belmiro Moreira (moreira-belmiro-email-lists) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1810340 Title: Repetitive info messages from nova-compute Status in OpenStack Compute (nova): In Progress Bug description: There are 2 repetitive info messages from nova-compute: INFO nova.compute.resource_tracker Final resource view: INFO nova.virt.libvirt.driver Libvirt baseline CPU By default they are logged every minute. In my view they should be "debug" messages. In large infrastructures that store log files for analytics they use significant storage space without bringing reasonable value. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1810340/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1805989] [NEW] Weight policy to stack/spread instances and "max_placement_results"
Public bug reported: Weights are applyed by the scheduler. This means that if using "max_placement_results" with a number bellow to the existing resources, the weight policy will only be applied to the subset of allocation candidates retrieved by placement. As consequence we lose the policy to stack/spread instances. ** Affects: nova Importance: Undecided Status: New ** Tags: placement -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1805989 Title: Weight policy to stack/spread instances and "max_placement_results" Status in OpenStack Compute (nova): New Bug description: Weights are applyed by the scheduler. This means that if using "max_placement_results" with a number bellow to the existing resources, the weight policy will only be applied to the subset of allocation candidates retrieved by placement. As consequence we lose the policy to stack/spread instances. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1805989/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1805984] [NEW] Placement is not aware of disable compute nodes
Public bug reported: Placement doesn't know if a resource provider (in this particular case a compute node) is disabled. This is only filtered by the scheduler using the "ComputeFilter". However, when using the option "max_placement_results" to restrict the amount of placement results there is the possibility to get only "disabled" allocation candidates from placement. The creation of new VMs will end up in ERROR because there are "No Valid Hosts". There are several use-cases when an operator may want to disable nodes to avoid the creation of new VMs. Related with: https://bugs.launchpad.net/nova/+bug/1708958 ** Affects: nova Importance: Undecided Status: New ** Tags: placement -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1805984 Title: Placement is not aware of disable compute nodes Status in OpenStack Compute (nova): New Bug description: Placement doesn't know if a resource provider (in this particular case a compute node) is disabled. This is only filtered by the scheduler using the "ComputeFilter". However, when using the option "max_placement_results" to restrict the amount of placement results there is the possibility to get only "disabled" allocation candidates from placement. The creation of new VMs will end up in ERROR because there are "No Valid Hosts". There are several use-cases when an operator may want to disable nodes to avoid the creation of new VMs. Related with: https://bugs.launchpad.net/nova/+bug/1708958 To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1805984/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1801897] [NEW] List AVZs can take several seconds
Public bug reported: Getting the list of AVZs can take several seconds (~30 secs. in our case) This is noticeable in Horizon when creating a new instance because the user can't select an AVZ until this completes. workflow: - get all services from all cells (~1 for us) - fetch all aggregates which are tagged as an AVZ - construct a dict of {'service['host']: avz.value} - return a dict of {'avz_value': list of hosts} - separate available and not available zones. Reproducible in Queens, Rocky ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1801897 Title: List AVZs can take several seconds Status in OpenStack Compute (nova): New Bug description: Getting the list of AVZs can take several seconds (~30 secs. in our case) This is noticeable in Horizon when creating a new instance because the user can't select an AVZ until this completes. workflow: - get all services from all cells (~1 for us) - fetch all aggregates which are tagged as an AVZ - construct a dict of {'service['host']: avz.value} - return a dict of {'avz_value': list of hosts} - separate available and not available zones. Reproducible in Queens, Rocky To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1801897/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1796920] [NEW] Baremetal nodes should not be exposing non-custom-resource-class (vcpu, ram, disk)
Public bug reported: Description === Baremetal nodes report CPU, RAM and DISK inventory. The issue is that allocations for baremetal nodes are only done considering the custom_resource_class. This happens because baremetal flavors are set to not consume these resources. See: https://docs.openstack.org/ironic/queens/install/configure-nova-flavors.html If we use flavor that doesn't include a custom_resource_class , placement can include a baremetal nodee that are already deployed because cpu, ram, disk is available (but results in a error from ironic), or worst the instance is created in a baremetal node (if it wasn't deployed yet). Environment === Nova and Ironic running Queens release. ** Affects: nova Importance: Undecided Status: Invalid ** Affects: nova/pike Importance: High Status: Triaged ** Affects: nova/queens Importance: High Status: Triaged ** Affects: nova/rocky Importance: High Assignee: Matt Riedemann (mriedem) Status: Triaged ** Tags: ironic -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1796920 Title: Baremetal nodes should not be exposing non-custom-resource-class (vcpu, ram, disk) Status in OpenStack Compute (nova): Invalid Status in OpenStack Compute (nova) pike series: Triaged Status in OpenStack Compute (nova) queens series: Triaged Status in OpenStack Compute (nova) rocky series: Triaged Bug description: Description === Baremetal nodes report CPU, RAM and DISK inventory. The issue is that allocations for baremetal nodes are only done considering the custom_resource_class. This happens because baremetal flavors are set to not consume these resources. See: https://docs.openstack.org/ironic/queens/install/configure-nova-flavors.html If we use flavor that doesn't include a custom_resource_class , placement can include a baremetal nodee that are already deployed because cpu, ram, disk is available (but results in a error from ironic), or worst the instance is created in a baremetal node (if it wasn't deployed yet). Environment === Nova and Ironic running Queens release. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1796920/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1771810] [NEW] Quota calculation connects to all available cells
Public bug reported: Quota utilisation calculation connects to all cells DBs to get all consumed resources for a project. When having several cells this can be inefficient and can fail if one of the cell DBs is not available. To calculate the quota utilization of a project should be enough to use only the cells where the project has/had instances. This information is available in nova_api DB. ** Affects: nova Importance: Undecided Assignee: Surya Seetharaman (tssurya) Status: New ** Tags: cells quotas -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1771810 Title: Quota calculation connects to all available cells Status in OpenStack Compute (nova): New Bug description: Quota utilisation calculation connects to all cells DBs to get all consumed resources for a project. When having several cells this can be inefficient and can fail if one of the cell DBs is not available. To calculate the quota utilization of a project should be enough to use only the cells where the project has/had instances. This information is available in nova_api DB. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1771810/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1771806] [NEW] Ironic nova-compute failover creates new resource provider removing the resource_provider_aggregates link
Public bug reported: When using the request_filter functionality, aggregates are mapped to placement_aggregates. placement_provider_aggregates contains the resource providers mapped in aggregate_hosts. The problem happens when a nova-compute for ironic fails and hosts are automatically moved to a different nova-compute. In this case a new compute_node entry is created originating a new resource provider. As consequence the placement_provider_aggregates doesn't have the new resource providers. ** Affects: nova Importance: Undecided Assignee: Surya Seetharaman (tssurya) Status: New ** Tags: ironic placement ** Tags removed: placem ** Tags added: ironic placement -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1771806 Title: Ironic nova-compute failover creates new resource provider removing the resource_provider_aggregates link Status in OpenStack Compute (nova): New Bug description: When using the request_filter functionality, aggregates are mapped to placement_aggregates. placement_provider_aggregates contains the resource providers mapped in aggregate_hosts. The problem happens when a nova-compute for ironic fails and hosts are automatically moved to a different nova-compute. In this case a new compute_node entry is created originating a new resource provider. As consequence the placement_provider_aggregates doesn't have the new resource providers. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1771806/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1768876] [NEW] Old instances can get AVZ from metadata
Public bug reported: Can't get AVZ for old instances: curl http://169.254.169.254/latest/meta-data/placement/availability-zone None# This is because the upcall to the nova_api DB was removed in the commit: 9f7bac2 and old instances may haven't the AVZ defined. Previously, the AVZ in the instance was only set if explicitly defined by the user. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1768876 Title: Old instances can get AVZ from metadata Status in OpenStack Compute (nova): New Bug description: Can't get AVZ for old instances: curl http://169.254.169.254/latest/meta-data/placement/availability-zone None# This is because the upcall to the nova_api DB was removed in the commit: 9f7bac2 and old instances may haven't the AVZ defined. Previously, the AVZ in the instance was only set if explicitly defined by the user. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1768876/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1767309] [NEW] Placement - Make association_refresh configurable
Public bug reported: In Queens the provider-tree refresh happens every 5 min (also in master). ASSOCIATION_REFRESH = 300 For large deployments this creates unnecessary load in placement. This option should be configurable. related with: https://review.openstack.org/#/c/535517/ ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1767309 Title: Placement - Make association_refresh configurable Status in OpenStack Compute (nova): New Bug description: In Queens the provider-tree refresh happens every 5 min (also in master). ASSOCIATION_REFRESH = 300 For large deployments this creates unnecessary load in placement. This option should be configurable. related with: https://review.openstack.org/#/c/535517/ To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1767309/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1767303] [NEW] Scheduler connects to all cells DBs to gather compute nodes info
Public bug reported: The scheduler host.manager connects to all cells DBs to get compute node info even if only a subset of compute nodes uuids are given by placement. This has a performance impact in large cloud deployments with several cells. Also related with: https://review.openstack.org/#/c/539617/9/nova/scheduler/host_manager.py {code} def _get_computes_for_cells(self, context, cells, compute_uuids=None) for cell in cells: LOG.debug('Getting compute nodes and services for cell %(cell)s', {'cell': cell.identity}) with context_module.target_cell(context, cell) as cctxt: if compute_uuids is None: compute_nodes[cell.uuid].extend( objects.ComputeNodeList.get_all(cctxt)) else: compute_nodes[cell.uuid].extend( objects.ComputeNodeList.get_all_by_uuids( cctxt, compute_uuids)) services.update( {service.host: service for service in objects.ServiceList.get_by_binary( cctxt, 'nova-compute', include_disabled=True)}) return compute_nodes, services {code} ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1767303 Title: Scheduler connects to all cells DBs to gather compute nodes info Status in OpenStack Compute (nova): New Bug description: The scheduler host.manager connects to all cells DBs to get compute node info even if only a subset of compute nodes uuids are given by placement. This has a performance impact in large cloud deployments with several cells. Also related with: https://review.openstack.org/#/c/539617/9/nova/scheduler/host_manager.py {code} def _get_computes_for_cells(self, context, cells, compute_uuids=None) for cell in cells: LOG.debug('Getting compute nodes and services for cell %(cell)s', {'cell': cell.identity}) with context_module.target_cell(context, cell) as cctxt: if compute_uuids is None: compute_nodes[cell.uuid].extend( objects.ComputeNodeList.get_all(cctxt)) else: compute_nodes[cell.uuid].extend( objects.ComputeNodeList.get_all_by_uuids( cctxt, compute_uuids)) services.update( {service.host: service for service in objects.ServiceList.get_by_binary( cctxt, 'nova-compute', include_disabled=True)}) return compute_nodes, services {code} To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1767303/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1761197] [NEW] Not defined keypairs in instance_extra cellsV1 DBs
Public bug reported: In newton there was the data migration to fill the "keypair" in instance_extra table. The migration checks if an instance has a keypair and then adds the keypair entry in the instance_extra table. This works if the keypair still exists in the keypair table. However, when running with cellsV1 the keypairs only exist in top DB and the migration only works in the instance_extra table of that DB. This means that in all cells DBs the instance_extra has the keypair not defined. This is important when migrating to cellsV2 because we will rely in the cells DBs. We should have a migration that gets the keypairs from nova_api DB to fill the keypair in instance_extra of the different cells DBs. ** Affects: nova Importance: Undecided Assignee: Surya Seetharaman (tssurya) Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1761197 Title: Not defined keypairs in instance_extra cellsV1 DBs Status in OpenStack Compute (nova): New Bug description: In newton there was the data migration to fill the "keypair" in instance_extra table. The migration checks if an instance has a keypair and then adds the keypair entry in the instance_extra table. This works if the keypair still exists in the keypair table. However, when running with cellsV1 the keypairs only exist in top DB and the migration only works in the instance_extra table of that DB. This means that in all cells DBs the instance_extra has the keypair not defined. This is important when migrating to cellsV2 because we will rely in the cells DBs. We should have a migration that gets the keypairs from nova_api DB to fill the keypair in instance_extra of the different cells DBs. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1761197/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1761198] [NEW] "Orphan" request_specs and instance_mappings
Public bug reported: request_specs and instance_mappings in nova_api DB are not removed when an instance is deleted. In Queens they are removed when the instances are archived (https://review.openstack.org/#/c/515034/) However, for the deployments that archived instances before running Queens they will have request_specs and instance_mappings that are not associated to any instance (they were already deleted). We should have a nova-manage tool to clean these "orphan" records. ** Affects: nova Importance: Undecided Assignee: Surya Seetharaman (tssurya) Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1761198 Title: "Orphan" request_specs and instance_mappings Status in OpenStack Compute (nova): New Bug description: request_specs and instance_mappings in nova_api DB are not removed when an instance is deleted. In Queens they are removed when the instances are archived (https://review.openstack.org/#/c/515034/) However, for the deployments that archived instances before running Queens they will have request_specs and instance_mappings that are not associated to any instance (they were already deleted). We should have a nova-manage tool to clean these "orphan" records. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1761198/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1757472] [NEW] Required to define database/connection when running services for nova_api cell
Public bug reported: Services in nova_api cell fail to run if database/connection is not defined. These services should only use api_database/connection. In devstack database/connection is defined with the cell0 DB endpoint. This shouldn't be required because the cell0 is set in nova_api DB. ** Affects: nova Importance: Undecided Assignee: Surya Seetharaman (tssurya) Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1757472 Title: Required to define database/connection when running services for nova_api cell Status in OpenStack Compute (nova): New Bug description: Services in nova_api cell fail to run if database/connection is not defined. These services should only use api_database/connection. In devstack database/connection is defined with the cell0 DB endpoint. This shouldn't be required because the cell0 is set in nova_api DB. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1757472/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1735353] [NEW] build_request not deleted when using cellsV1 and local nova_api DB
Public bug reported: Description === build_request not deleted when using cellsV1 and local nova_api Placement needs to be enabled in Newton. CellsV1 installations can deploy a placement service per child cell in order to have a more efficient schedule during the transition to cellV2. This requires a nova_api DB per cell. With this configuration the "build_request" that was created in the top nova_api DB is not deleted after the VM creation because is triggered in "conductor/manager.py" that runs in the child cell and is pointing to the local nova_api DB. This leaves new VMs in BUILD state. Expected result === build_request is removed from top nova_api DB. Actual result = nova-cells tries to remove build_request from local cell nova_api DB. Environment === Nova newton ** Affects: nova Importance: Undecided Status: New ** Tags: cells -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1735353 Title: build_request not deleted when using cellsV1 and local nova_api DB Status in OpenStack Compute (nova): New Bug description: Description === build_request not deleted when using cellsV1 and local nova_api Placement needs to be enabled in Newton. CellsV1 installations can deploy a placement service per child cell in order to have a more efficient schedule during the transition to cellV2. This requires a nova_api DB per cell. With this configuration the "build_request" that was created in the top nova_api DB is not deleted after the VM creation because is triggered in "conductor/manager.py" that runs in the child cell and is pointing to the local nova_api DB. This leaves new VMs in BUILD state. Expected result === build_request is removed from top nova_api DB. Actual result = nova-cells tries to remove build_request from local cell nova_api DB. Environment === Nova newton To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1735353/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1727266] [NEW] archive_deleted_instances is not atomic for insert/delete
Public bug reported: Description === Archive deleted instances first moves deleted rows to the shadow tables and then deletes the rows from the original tables. However, because it does 2 different selects (to get the rows to insert and to delete) we can have the case that a row is not inserted in the shadow table but removed from the original. This can happen when there are new deleted rows between the insert and delete. Shouldn't we deleted explicitly only the IDs that were inserted? See: insert = shadow_table.insert(inline=True).\ from_select(columns, sql.select([table], deleted_column != deleted_column.default.arg). order_by(column).limit(max_rows)) query_delete = sql.select([column], deleted_column != deleted_column.default.arg).\ order_by(column).limit(max_rows) delete_statement = DeleteFromSelect(table, query_delete, column) (...) conn.execute(insert) result_delete = conn.execute(delete_statement) ** Affects: nova Importance: Undecided Assignee: Belmiro Moreira (moreira-belmiro-email-lists) Status: New ** Changed in: nova Assignee: (unassigned) => Belmiro Moreira (moreira-belmiro-email-lists) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1727266 Title: archive_deleted_instances is not atomic for insert/delete Status in OpenStack Compute (nova): New Bug description: Description === Archive deleted instances first moves deleted rows to the shadow tables and then deletes the rows from the original tables. However, because it does 2 different selects (to get the rows to insert and to delete) we can have the case that a row is not inserted in the shadow table but removed from the original. This can happen when there are new deleted rows between the insert and delete. Shouldn't we deleted explicitly only the IDs that were inserted? See: insert = shadow_table.insert(inline=True).\ from_select(columns, sql.select([table], deleted_column != deleted_column.default.arg). order_by(column).limit(max_rows)) query_delete = sql.select([column], deleted_column != deleted_column.default.arg).\ order_by(column).limit(max_rows) delete_statement = DeleteFromSelect(table, query_delete, column) (...) conn.execute(insert) result_delete = conn.execute(delete_statement) To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1727266/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1726310] [NEW] nova doesn't list services if it can't connect to a cell DB
Public bug reported: Description === nova doesn't list services if it can't connect to a child cell DB. I would expect nova to show the services from all child DBs that it can connect. For the child DBs that can't connect it can show for the mandatory services (nova-conductor) with the status "not available" and in the disabled reason why ("can't connect to the DB") Steps to reproduce == Have at least 2 child cells. Stop the DB in one of them. "nova service-list" fails with "ERROR (ClientException): Unexpected API Error." Not given any information about what's causing the problem. Expected result === List the services of the available cells and list the status of the mandatory services of the affected cells as "not available". Actual result = $nova service-list fails. Environment === nova master (commit: 8d21d711000fff80eb367692b157d09b6532923f) ** Affects: nova Importance: Undecided Assignee: Belmiro Moreira (moreira-belmiro-email-lists) Status: New ** Tags: cells ** Changed in: nova Assignee: (unassigned) => Belmiro Moreira (moreira-belmiro-email-lists) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1726310 Title: nova doesn't list services if it can't connect to a cell DB Status in OpenStack Compute (nova): New Bug description: Description === nova doesn't list services if it can't connect to a child cell DB. I would expect nova to show the services from all child DBs that it can connect. For the child DBs that can't connect it can show for the mandatory services (nova-conductor) with the status "not available" and in the disabled reason why ("can't connect to the DB") Steps to reproduce == Have at least 2 child cells. Stop the DB in one of them. "nova service-list" fails with "ERROR (ClientException): Unexpected API Error." Not given any information about what's causing the problem. Expected result === List the services of the available cells and list the status of the mandatory services of the affected cells as "not available". Actual result = $nova service-list fails. Environment === nova master (commit: 8d21d711000fff80eb367692b157d09b6532923f) To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1726310/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1726301] [NEW] Nova should list instances even if it can't connect to a cell DB
Public bug reported: Description === One of the goals of cells is to allow nova scale and to have cells as failure domains. However, if a cell DB goes down nova doesn't list any instance. Even if the project doesn't have any instance in the affected cell. This affects all users. The behavior that I would expect is nova to show what's available from the nova_api DB if a cell DB is not available. (UUIDs and can we look into the request_spec?) Steps to reproduce == Have at least 2 child cells. Stop the DB in one of them. "nova list" fails with "ERROR (ClientException): Unexpected API Error." Not given any more information to the user. Expected result === List the project instances. For the instances in the affect cell, list the available information in the nova_api. Actual result = $nova list fails without showing the project instances. Environment === nova master (commit: 8d21d711000fff80eb367692b157d09b6532923f) ** Affects: nova Importance: Undecided Assignee: Belmiro Moreira (moreira-belmiro-email-lists) Status: New ** Tags: cells ** Description changed: Description === One of the goals of cells is to allow nova scale and to have cells as failure domains. However, if a cell DB goes down nova doesn't list any instance. Even if the project doesn't have any instance in the affected cell. This affects all users. The behavior that I would expect is nova to show what's available from the nova_api DB if a cell DB is not available. (UUIDs and can we look into the request_spec?) - Steps to reproduce == Have at least 2 child cells. - Stop the DB of one of them. + Stop the DB in one of them. "nova list" fails with "ERROR (ClientException): Unexpected API Error." Not given any more information to the user. - Expected result === List the project instances. - For the instances in the affect cell list the available information in the nova_api. + For the instances in the affect cell, list the available information in the nova_api. Actual result = - $nova list - fails without showing the project instance. + $nova list + fails without showing the project instances. Environment === nova master (commit: 8d21d711000fff80eb367692b157d09b6532923f) ** Changed in: nova Assignee: (unassigned) => Belmiro Moreira (moreira-belmiro-email-lists) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1726301 Title: Nova should list instances even if it can't connect to a cell DB Status in OpenStack Compute (nova): New Bug description: Description === One of the goals of cells is to allow nova scale and to have cells as failure domains. However, if a cell DB goes down nova doesn't list any instance. Even if the project doesn't have any instance in the affected cell. This affects all users. The behavior that I would expect is nova to show what's available from the nova_api DB if a cell DB is not available. (UUIDs and can we look into the request_spec?) Steps to reproduce == Have at least 2 child cells. Stop the DB in one of them. "nova list" fails with "ERROR (ClientException): Unexpected API Error." Not given any more information to the user. Expected result === List the project instances. For the instances in the affect cell, list the available information in the nova_api. Actual result = $nova list fails without showing the project instances. Environment === nova master (commit: 8d21d711000fff80eb367692b157d09b6532923f) To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1726301/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1681431] Re: "nova-manage db sync" fails from Mitaka to Newton because deleted compute nodes
*** This bug is a duplicate of bug 1665719 *** https://bugs.launchpad.net/bugs/1665719 Already fixed #1665719 ** Changed in: nova Status: New => Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1681431 Title: "nova-manage db sync" fails from Mitaka to Newton because deleted compute nodes Status in OpenStack Compute (nova): Invalid Bug description: Description === "nova-manage db sync" fails from Mitaka to Newton because deleted compute nodes DB migration from Mitaka to Newton fails in migration 330 with: "error: There are still XX unmigrated records in the compute_nodes table. Migration cannot continue until all records have been migrated." This migration checks if there are compute_nodes without a UUID. However, "nova-manage db online_data_migrations" in Mitaka only migrates non deleted compute_node entries. Steps to reproduce == 1) Have a nova Mitaka DB (319) 2) Make sure you have a deleted entry (deleted>0) in "compute_nodes" table. 3) Make sure all data migrations are done in Mitaka. ("nova-manage db online_data_migrations") 4) Sync the DB for Newton. ("nova-manage db sync" in a Newton node) Expected result === DB migrations succeed (334) Actual result = DB doesn't migrate (329) Environment === Tested with "13.1.2" and "14.0.3". To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1681431/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1681431] [NEW] "nova-manage db sync" fails from Mitaka to Newton because deleted compute nodes
Public bug reported: Description === "nova-manage db sync" fails from Mitaka to Newton because deleted compute nodes DB migration from Mitaka to Newton fails in migration 330 with: "error: There are still XX unmigrated records in the compute_nodes table. Migration cannot continue until all records have been migrated." This migration checks if there are compute_nodes without a UUID. However, "nova-manage db online_data_migrations" in Mitaka only migrates non deleted compute_node entries. Steps to reproduce == 1) Have a nova Mitaka DB (319) 2) Make sure you have a deleted entry (deleted>0) in "compute_nodes" table. 3) Make sure all data migrations are done in Mitaka. ("nova-manage db online_data_migrations") 4) Sync the DB for Newton. ("nova-manage db sync" in a Newton node) Expected result === DB migrations succeed (334) Actual result = DB doesn't migrate (329) Environment === Tested with "13.1.2" and "14.0.3". ** Affects: nova Importance: Undecided Assignee: Belmiro Moreira (moreira-belmiro-email-lists) Status: New ** Changed in: nova Assignee: (unassigned) => Belmiro Moreira (moreira-belmiro-email-lists) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1681431 Title: "nova-manage db sync" fails from Mitaka to Newton because deleted compute nodes Status in OpenStack Compute (nova): New Bug description: Description === "nova-manage db sync" fails from Mitaka to Newton because deleted compute nodes DB migration from Mitaka to Newton fails in migration 330 with: "error: There are still XX unmigrated records in the compute_nodes table. Migration cannot continue until all records have been migrated." This migration checks if there are compute_nodes without a UUID. However, "nova-manage db online_data_migrations" in Mitaka only migrates non deleted compute_node entries. Steps to reproduce == 1) Have a nova Mitaka DB (319) 2) Make sure you have a deleted entry (deleted>0) in "compute_nodes" table. 3) Make sure all data migrations are done in Mitaka. ("nova-manage db online_data_migrations") 4) Sync the DB for Newton. ("nova-manage db sync" in a Newton node) Expected result === DB migrations succeed (334) Actual result = DB doesn't migrate (329) Environment === Tested with "13.1.2" and "14.0.3". To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1681431/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1533380] [NEW] Creating multiple instances with a single request when using cells creates wrong instance names
Public bug reported: When creating multiple instances with a single request the instance name has the format defined in the "multi_instance_display_name_template" option. By default: multi_instance_display_name_template=%(name)s-%(count)d When booting two instances (num-instances=2) with the name=test is expected to have the following instance names: test-1 test-2 However, if using cells (only considering 2 levels) we have the following names: test-1-1 test-1-2 Increasing the number of cell levels adds more hops in the instance name. Changing the "multi_instance_display_name_template" to uuids has the same problem. For example: (consider a random uuid) test-- test-- ** Affects: nova Importance: Undecided Assignee: Belmiro Moreira (moreira-belmiro-email-lists) Status: New ** Tags: cells ** Changed in: nova Assignee: (unassigned) => Belmiro Moreira (moreira-belmiro-email-lists) ** Description changed: - When creating multiple instances with a single request the instance name has the - format defined in the "multi_instance_display_name_template" option. + When creating multiple instances with a single request the instance name has the format defined in the "multi_instance_display_name_template" option. By default: multi_instance_display_name_template=%(name)s-%(count)d - When booting two instances (num-instances=2) with the name=test is expected to have - the following instance names: + When booting two instances (num-instances=2) with the name=test is expected to have the following instance names: test-1 test-2 However, if using cells (only considering 2 levels) we have the following names: test-1-1 test-1-2 Increasing the number of cell levels adds more hops in the instance name. Changing the "multi_instance_display_name_template" to uuids has the same problem. For example: (consider a random uuid) test-- test-- -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1533380 Title: Creating multiple instances with a single request when using cells creates wrong instance names Status in OpenStack Compute (nova): New Bug description: When creating multiple instances with a single request the instance name has the format defined in the "multi_instance_display_name_template" option. By default: multi_instance_display_name_template=%(name)s-%(count)d When booting two instances (num-instances=2) with the name=test is expected to have the following instance names: test-1 test-2 However, if using cells (only considering 2 levels) we have the following names: test-1-1 test-1-2 Increasing the number of cell levels adds more hops in the instance name. Changing the "multi_instance_display_name_template" to uuids has the same problem. For example: (consider a random uuid) test-- test-- To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1533380/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1532562] [NEW] Cell capacities updates include available resources of compute nodes "down"
Public bug reported: If a child cell has compute nodes without a heartbeat update but enabled (XXX state with "nova-manage service list") the child cell continues to consider the available resources of these compute nodes when updating the cell capacity. This can be problematic when having several cells and trying to fill them completely. Requests are sent to the cell that can fit more instances of the requested type however when compute nodes are "down" the requests will fail with "No valid host" in the cell. When updating the cell capacity the "disabled" compute nodes are excluded. This should also happen if the compute node didn't have a heartbeat update during the "CONF.service_down_time". How to reproduce: 1) Have a cell environment with 2 child cells (A and B). 2) Have nova-cells running in "debug". Confirm that the "Received capacities from child cell" A and B (in top nova-cell log) matches the number of available resources. 4) Stop some compute nodes in cell A. 5) Confirm that the "Received capacities from child cell A" don't change. 6) Cell scheduler can send requests to cell A that can fail with "No valid host". ** Affects: nova Importance: Undecided Assignee: Belmiro Moreira (moreira-belmiro-email-lists) Status: New ** Tags: cells ** Changed in: nova Assignee: (unassigned) => Belmiro Moreira (moreira-belmiro-email-lists) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1532562 Title: Cell capacities updates include available resources of compute nodes "down" Status in OpenStack Compute (nova): New Bug description: If a child cell has compute nodes without a heartbeat update but enabled (XXX state with "nova-manage service list") the child cell continues to consider the available resources of these compute nodes when updating the cell capacity. This can be problematic when having several cells and trying to fill them completely. Requests are sent to the cell that can fit more instances of the requested type however when compute nodes are "down" the requests will fail with "No valid host" in the cell. When updating the cell capacity the "disabled" compute nodes are excluded. This should also happen if the compute node didn't have a heartbeat update during the "CONF.service_down_time". How to reproduce: 1) Have a cell environment with 2 child cells (A and B). 2) Have nova-cells running in "debug". Confirm that the "Received capacities from child cell" A and B (in top nova-cell log) matches the number of available resources. 4) Stop some compute nodes in cell A. 5) Confirm that the "Received capacities from child cell A" don't change. 6) Cell scheduler can send requests to cell A that can fail with "No valid host". To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1532562/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1524114] [NEW] nova-scheduler also loads deleted instances at startup
Public bug reported: nova-scheduler is loading all instances (including deleted) at startup. Experienced problems when each node has >6000 deleted instances, even when using batches of 10 nodes. Each query can take several minutes and transfer several GB of data. This prevented nova-scheduler connect to rabbitmq. ### When nova-scheduler starts it calls "_async_init_instance_info()" and it does an "InstanceList.get_by_filters" that uses batches of 10 nodes. This uses "instance_get_all_by_filters_sort", however "Deleted instances will be returned by default, unless there's a filter that says otherwise". Adding the filter: {"deleted": False} fixes the problem. ** Affects: nova Importance: Undecided Status: New ** Tags: scheduler -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1524114 Title: nova-scheduler also loads deleted instances at startup Status in OpenStack Compute (nova): New Bug description: nova-scheduler is loading all instances (including deleted) at startup. Experienced problems when each node has >6000 deleted instances, even when using batches of 10 nodes. Each query can take several minutes and transfer several GB of data. This prevented nova-scheduler connect to rabbitmq. ### When nova-scheduler starts it calls "_async_init_instance_info()" and it does an "InstanceList.get_by_filters" that uses batches of 10 nodes. This uses "instance_get_all_by_filters_sort", however "Deleted instances will be returned by default, unless there's a filter that says otherwise". Adding the filter: {"deleted": False} fixes the problem. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1524114/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1517006] [NEW] Can't create instances with flavors that have extra specs in a cell setup
Public bug reported: In a cell setup can't create instances with flavors that have extra specs like: hw:numa_nodes hw:mem_page_size nova-cell in the "child cell" fails with: 2015-11-17 10:51:50.574 ERROR nova.cells.scheduler [req-f7dc64e6-a545-4c2c-bc57-4e4a2e86cf58 demo demo] Couldn't communicate with cell 'cell' 2015-11-17 10:51:50.574 TRACE nova.cells.scheduler Traceback (most recent call last): 2015-11-17 10:51:50.574 TRACE nova.cells.scheduler File "/opt/stack/nova/nova/cells/scheduler.py", line 186, in _build_instances 2015-11-17 10:51:50.574 TRACE nova.cells.scheduler image, security_groups, block_device_mapping) 2015-11-17 10:51:50.574 TRACE nova.cells.scheduler File "/opt/stack/nova/nova/cells/scheduler.py", line 109, in _create_instances_here 2015-11-17 10:51:50.574 TRACE nova.cells.scheduler instance.update(instance_values) 2015-11-17 10:51:50.574 TRACE nova.cells.scheduler File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 727, in update 2015-11-17 10:51:50.574 TRACE nova.cells.scheduler setattr(self, key, value) 2015-11-17 10:51:50.574 TRACE nova.cells.scheduler File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 71, in setter 2015-11-17 10:51:50.574 TRACE nova.cells.scheduler field_value = field.coerce(self, name, value) 2015-11-17 10:51:50.574 TRACE nova.cells.scheduler File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/fields.py", line 189, in coerce 2015-11-17 10:51:50.574 TRACE nova.cells.scheduler return self._type.coerce(obj, attr, value) 2015-11-17 10:51:50.574 TRACE nova.cells.scheduler File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/fields.py", line 506, in coerce 2015-11-17 10:51:50.574 TRACE nova.cells.scheduler 'valtype': obj_name}) 2015-11-17 10:51:50.574 TRACE nova.cells.scheduler ValueError: An object of type InstanceNUMATopology is required in field numa_topology, not a 2015-11-17 10:51:50.574 TRACE nova.cells.scheduler 2015-11-17 10:51:50.574 ERROR nova.cells.scheduler [req-f7dc64e6-a545-4c2c-bc57-4e4a2e86cf58 demo demo] Couldn't communicate with any cells Reproduce steps: 1) Setup nova in order to use cells. 2) Create a flavor with the extra spec "hw:numa_nodes" nova flavor-create m1.nano.numa2 30 64 1 1 nova flavor-key 30 set hw:numa_nodes=1 3) Create an instance with the new flavor Actual Result: Instance status: ERROR Instance task state: scheduling Trace in "child cell". Tested in devstack (master). Tested in Kilo. ** Affects: nova Importance: Undecided Status: New ** Tags: cells -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1517006 Title: Can't create instances with flavors that have extra specs in a cell setup Status in OpenStack Compute (nova): New Bug description: In a cell setup can't create instances with flavors that have extra specs like: hw:numa_nodes hw:mem_page_size nova-cell in the "child cell" fails with: 2015-11-17 10:51:50.574 ERROR nova.cells.scheduler [req-f7dc64e6-a545-4c2c-bc57-4e4a2e86cf58 demo demo] Couldn't communicate with cell 'cell' 2015-11-17 10:51:50.574 TRACE nova.cells.scheduler Traceback (most recent call last): 2015-11-17 10:51:50.574 TRACE nova.cells.scheduler File "/opt/stack/nova/nova/cells/scheduler.py", line 186, in _build_instances 2015-11-17 10:51:50.574 TRACE nova.cells.scheduler image, security_groups, block_device_mapping) 2015-11-17 10:51:50.574 TRACE nova.cells.scheduler File "/opt/stack/nova/nova/cells/scheduler.py", line 109, in _create_instances_here 2015-11-17 10:51:50.574 TRACE nova.cells.scheduler instance.update(instance_values) 2015-11-17 10:51:50.574 TRACE nova.cells.scheduler File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 727, in update 2015-11-17 10:51:50.574 TRACE nova.cells.scheduler setattr(self, key, value) 2015-11-17 10:51:50.574 TRACE nova.cells.scheduler File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 71, in setter 2015-11-17 10:51:50.574 TRACE nova.cells.scheduler field_value = field.coerce(self, name, value) 2015-11-17 10:51:50.574 TRACE nova.cells.scheduler File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/fields.py", line 189, in coerce 2015-11-17 10:51:50.574 TRACE nova.cells.scheduler return self._type.coerce(obj, attr, value) 2015-11-17 10:51:50.574 TRACE nova.cells.scheduler File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/fields.py", line 506, in coerce 2015-11-17 10:51:50.574 TRACE nova.cells.scheduler 'valtype': obj_name}) 2015-11-17 10:51:50.574 TRACE nova.cells.scheduler ValueError: An object of type InstanceNUMATopology is required in field numa_topology, not a 2015-11-17
[Yahoo-eng-team] [Bug 1461777] [NEW] NUMA cell overcommit can leave NUMA cells unused
Public bug reported: NUMA cell overcommit can leave NUMA cells unused When no NUMA configuration is defined for the guest (no flavor extra specs), nova identifies the NUMA topology of the host and tries to match the cpu placement to a NUMA cell (cpuset). The cpuset is selected randomly. pin_cpuset = random.choice(viable_cells_cpus) #nova/virt/libvirt/driver.py However, this can lead to NUMA cells not being used. This is particular noticeable when the flavor as the same number of vcpus as the host NUMA cells and in the host CPUs are not overcommit (cpu_allocation_ratio = 1) ### Particular use case: Compute nodes with the NUMA topology: VirtNUMAHostTopology: {'cells': [{'mem': {'total': 12279, 'used': 0}, 'cpu_usage': 0, 'cpus': '0,1,2,3,8,9,10,11', 'id': 0}, {'mem': {'total': 12288, 'used': 0}, 'cpu_usage': 0, 'cpus': '4,5,6,7,12,13,14,15', 'id': 1}]} No CPU overcommit: cpu_allocation_ratio = 1 Boot instances using a flavor with 8 vcpus. (No NUMA topology defined for the guest in the flavor) In this particular case the host can have 2 instances. (no cpu overcommit) Both instances can be allocated (random) with the same cpuset from the 2 options: vcpu placement='static' cpuset='4-7,12-15'8/vcpu vcpu placement='static' cpuset='0-3,8-11'8/vcpu As consequence half of the host CPUs are not used. ### How to reproduce: Using: nova 2014.2.2 (not tested in trunk however the code path looks similar) 1. set cpu_allocation_ratio = 1 2. Identify the NUMA topology of the compute node 3. Using a flavor with a number of vcpus that matches a NUMA cell in the compute node, boot instances until fill the compute node. 4. Check the cpu placement cpuset used by the each instance. Notes: - at this point instances can use the same cpuset leaving NUMA cells unused. - the selection of the cpuset is random. Different tries may be needed. ** Affects: nova Importance: Undecided Status: New ** Tags: libvirt -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1461777 Title: NUMA cell overcommit can leave NUMA cells unused Status in OpenStack Compute (Nova): New Bug description: NUMA cell overcommit can leave NUMA cells unused When no NUMA configuration is defined for the guest (no flavor extra specs), nova identifies the NUMA topology of the host and tries to match the cpu placement to a NUMA cell (cpuset). The cpuset is selected randomly. pin_cpuset = random.choice(viable_cells_cpus) #nova/virt/libvirt/driver.py However, this can lead to NUMA cells not being used. This is particular noticeable when the flavor as the same number of vcpus as the host NUMA cells and in the host CPUs are not overcommit (cpu_allocation_ratio = 1) ### Particular use case: Compute nodes with the NUMA topology: VirtNUMAHostTopology: {'cells': [{'mem': {'total': 12279, 'used': 0}, 'cpu_usage': 0, 'cpus': '0,1,2,3,8,9,10,11', 'id': 0}, {'mem': {'total': 12288, 'used': 0}, 'cpu_usage': 0, 'cpus': '4,5,6,7,12,13,14,15', 'id': 1}]} No CPU overcommit: cpu_allocation_ratio = 1 Boot instances using a flavor with 8 vcpus. (No NUMA topology defined for the guest in the flavor) In this particular case the host can have 2 instances. (no cpu overcommit) Both instances can be allocated (random) with the same cpuset from the 2 options: vcpu placement='static' cpuset='4-7,12-15'8/vcpu vcpu placement='static' cpuset='0-3,8-11'8/vcpu As consequence half of the host CPUs are not used. ### How to reproduce: Using: nova 2014.2.2 (not tested in trunk however the code path looks similar) 1. set cpu_allocation_ratio = 1 2. Identify the NUMA topology of the compute node 3. Using a flavor with a number of vcpus that matches a NUMA cell in the compute node, boot instances until fill the compute node. 4. Check the cpu placement cpuset used by the each instance. Notes: - at this point instances can use the same cpuset leaving NUMA cells unused. - the selection of the cpuset is random. Different tries may be needed. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1461777/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1454418] [NEW] Evacuate fails when using cells - AttributeError: 'NoneType' object has no attribute 'count'
Public bug reported: nova version: 2014.2.2 Using cells (parent - child setup) How to reproduce: nova evacuate instance_uuid target_host ERROR: The server has either erred or is incapable of performing the requested operation. (HTTP 500) (Request-ID: req-af20-182a-4acd-869a-1b23314b21d4) LOG: 2015-05-12 23:17:27.274 8013 ERROR nova.api.openstack [req-af20-182a-4acd-869a-1b23314b21d4 None] Caught error: 'NoneType' object has no attribute 'count' Traceback (most recent call last): File /usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py, line 134, in _dispatch_and_reply incoming.message)) File /usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py, line 177, in _dispatch return self._do_dispatch(endpoint, method, ctxt, args) File /usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py, line 123, in _do_dispatch result = getattr(endpoint, method)(ctxt, **new_args) File /usr/lib/python2.7/site-packages/nova/cells/manager.py, line 268, in service_get_by_compute_host service = response.value_or_raise() File /usr/lib/python2.7/site-packages/nova/cells/messaging.py, line 406, in process next_hop = self._get_next_hop() File /usr/lib/python2.7/site-packages/nova/cells/messaging.py, line 361, in _get_next_hop dest_hops = target_cell.count(_PATH_CELL_SEP) AttributeError: 'NoneType' object has no attribute 'count' ** Affects: nova Importance: Undecided Status: New ** Tags: cells -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1454418 Title: Evacuate fails when using cells - AttributeError: 'NoneType' object has no attribute 'count' Status in OpenStack Compute (Nova): New Bug description: nova version: 2014.2.2 Using cells (parent - child setup) How to reproduce: nova evacuate instance_uuid target_host ERROR: The server has either erred or is incapable of performing the requested operation. (HTTP 500) (Request-ID: req-af20-182a-4acd-869a-1b23314b21d4) LOG: 2015-05-12 23:17:27.274 8013 ERROR nova.api.openstack [req-af20-182a-4acd-869a-1b23314b21d4 None] Caught error: 'NoneType' object has no attribute 'count' Traceback (most recent call last): File /usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py, line 134, in _dispatch_and_reply incoming.message)) File /usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py, line 177, in _dispatch return self._do_dispatch(endpoint, method, ctxt, args) File /usr/lib/python2.7/site-packages/oslo/messaging/rpc/dispatcher.py, line 123, in _do_dispatch result = getattr(endpoint, method)(ctxt, **new_args) File /usr/lib/python2.7/site-packages/nova/cells/manager.py, line 268, in service_get_by_compute_host service = response.value_or_raise() File /usr/lib/python2.7/site-packages/nova/cells/messaging.py, line 406, in process next_hop = self._get_next_hop() File /usr/lib/python2.7/site-packages/nova/cells/messaging.py, line 361, in _get_next_hop dest_hops = target_cell.count(_PATH_CELL_SEP) AttributeError: 'NoneType' object has no attribute 'count' To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1454418/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1448564] [NEW] Rescue using cells fails with: unexpected keyword argument 'expected_task_state'
Public bug reported: Instance rescue gets stuck when using cells. nova version: 2014.2.2 Using cells (parent - child setup) How to reproduce: nova rescue instance_uuid - the instance task state stays in rescuing. - nova cells log of the child shows: 2015-04-26 01:26:09.475 20672 ERROR nova.cells.messaging [req-162b3318-70c3-4290-8e09-ffb9fbcef19d None] Error processing message locally: save() got an unexpected keyword argument 'expected_task_state' 2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging Traceback (most recent call last): 2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging File /usr/lib/python2.7/site-packages/nova/cells/messaging.py, line 199, in _process_locally 2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging resp_value = self.msg_runner._process_message_locally(self) 2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging File /usr/lib/python2.7/site-packages/nova/cells/messaging.py, line 1293, in _process_message_locally 2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging return fn(message, **message.method_kwargs) 2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging File /usr/lib/python2.7/site-packages/nova/cells/messaging.py, line 698, in run_compute_api_method 2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging return fn(message.ctxt, *args, **method_info['method_kwargs']) 2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging File /usr/lib/python2.7/site-packages/nova/compute/api.py, line 224, in wrapped 2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging return func(self, context, target, *args, **kwargs) 2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging File /usr/lib/python2.7/site-packages/nova/compute/api.py, line 214, in inner 2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging return function(self, context, instance, *args, **kwargs) 2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging File /usr/lib/python2.7/site-packages/nova/compute/api.py, line 195, in inner 2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging return f(self, context, instance, *args, **kw) 2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging File /usr/lib/python2.7/site-packages/nova/compute/api.py, line 2750, in rescue 2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging instance.save(expected_task_state=[None]) 2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging TypeError: save() got an unexpected keyword argument 'expected_task_state' ** Affects: nova Importance: Undecided Status: New ** Tags: cells -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1448564 Title: Rescue using cells fails with: unexpected keyword argument 'expected_task_state' Status in OpenStack Compute (Nova): New Bug description: Instance rescue gets stuck when using cells. nova version: 2014.2.2 Using cells (parent - child setup) How to reproduce: nova rescue instance_uuid - the instance task state stays in rescuing. - nova cells log of the child shows: 2015-04-26 01:26:09.475 20672 ERROR nova.cells.messaging [req-162b3318-70c3-4290-8e09-ffb9fbcef19d None] Error processing message locally: save() got an unexpected keyword argument 'expected_task_state' 2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging Traceback (most recent call last): 2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging File /usr/lib/python2.7/site-packages/nova/cells/messaging.py, line 199, in _process_locally 2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging resp_value = self.msg_runner._process_message_locally(self) 2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging File /usr/lib/python2.7/site-packages/nova/cells/messaging.py, line 1293, in _process_message_locally 2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging return fn(message, **message.method_kwargs) 2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging File /usr/lib/python2.7/site-packages/nova/cells/messaging.py, line 698, in run_compute_api_method 2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging return fn(message.ctxt, *args, **method_info['method_kwargs']) 2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging File /usr/lib/python2.7/site-packages/nova/compute/api.py, line 224, in wrapped 2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging return func(self, context, target, *args, **kwargs) 2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging File /usr/lib/python2.7/site-packages/nova/compute/api.py, line 214, in inner 2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging return function(self, context, instance, *args, **kwargs) 2015-04-26 01:26:09.475 20672 TRACE nova.cells.messaging File /usr/lib/python2.7/site-packages/nova/compute/api.py, line 195, in inner
[Yahoo-eng-team] [Bug 1417027] [NEW] No disable reason defined for new services when enable_new_services=False
Public bug reported: When a service is added and enable_new_services=False there is no disable reason specified. Services can be disabled by several reasons and the admins can use the API to specify a reason. However, having services disabled with no reason specified creates additional checks on the operators side that increases with the deployment size. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1417027 Title: No disable reason defined for new services when enable_new_services=False Status in OpenStack Compute (Nova): New Bug description: When a service is added and enable_new_services=False there is no disable reason specified. Services can be disabled by several reasons and the admins can use the API to specify a reason. However, having services disabled with no reason specified creates additional checks on the operators side that increases with the deployment size. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1417027/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1414480] [NEW] Cell type in “nova-manage cell create” is different from what is used in nova.conf
Public bug reported: The cell_type option is defined in nova.conf as “api” or “compute”. However, when creating a cell using “nova-manage” the cell type “parent” or “child” is expected. nova-manage cell_type should be consistent with what is allowed in nova.conf. ** Affects: nova Importance: Undecided Status: New ** Tags: cells -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1414480 Title: Cell type in “nova-manage cell create” is different from what is used in nova.conf Status in OpenStack Compute (Nova): New Bug description: The cell_type option is defined in nova.conf as “api” or “compute”. However, when creating a cell using “nova-manage” the cell type “parent” or “child” is expected. nova-manage cell_type should be consistent with what is allowed in nova.conf. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1414480/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1369518] [NEW] Server Group Anti/Affinity functionality doesn't work with cells
Public bug reported: Server Groups doesn't with cells. Tested in Icehouse. Using the API the server group is created in the top cell and not propagated to children cells. At this point booting a VM fails because schedulers in children cells are not aware of the server group. Creating the entries manually in the children cells databases avoid the instance scheduling to fail, however the anti/affinity policy is not correct. Server group members are only updated in the TOP cell. Schedulers at children cells are not aware of members in the group (empty table in children) so anti/affinity is not respected. ** Affects: nova Importance: Wishlist Status: Confirmed ** Tags: cells -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1369518 Title: Server Group Anti/Affinity functionality doesn't work with cells Status in OpenStack Compute (Nova): Confirmed Bug description: Server Groups doesn't with cells. Tested in Icehouse. Using the API the server group is created in the top cell and not propagated to children cells. At this point booting a VM fails because schedulers in children cells are not aware of the server group. Creating the entries manually in the children cells databases avoid the instance scheduling to fail, however the anti/affinity policy is not correct. Server group members are only updated in the TOP cell. Schedulers at children cells are not aware of members in the group (empty table in children) so anti/affinity is not respected. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1369518/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1334278] [NEW] limits with tenant parameter returns wrong maxTotal* values
Public bug reported: When querying for the absolute limits of a specific tenant the maxTotal* values reported aren't correct. How to reproduce: for example using devstack... OS_TENANT_NAME=demo (11b2b129994844798c98f437d9809a9c) OS_USERNAME=demo $nova absolute-limits +-+---+ | Name| Value | +-+---+ | maxServerMeta | 128 | | maxPersonality | 5 | | maxImageMeta| 128 | | maxPersonalitySize | 10240 | | maxTotalRAMSize | 1000 | | maxSecurityGroupRules | 20| | maxTotalKeypairs| 100 | | totalRAMUsed| 128 | | maxSecurityGroups | 10| | totalFloatingIpsUsed| 0 | | totalInstancesUsed | 2 | | totalSecurityGroupsUsed | 1 | | maxTotalFloatingIps | 10| | maxTotalInstances | 10|--- | totalCoresUsed | 2 | | maxTotalCores | 10| --- +-+---+ OS_TENANT_NAME=admin (b0f08277004b43aab516ae7dbf36ff51) OS_USERNAME=admin $nova absolute-limits +-++ | Name| Value | +-++ | maxServerMeta | 128| | maxPersonality | 5 | | maxImageMeta| 128| | maxPersonalitySize | 10240 | | maxTotalRAMSize | 151200 | | maxSecurityGroupRules | 20 | | maxTotalKeypairs| 100| | totalRAMUsed| 1152 | | maxSecurityGroups | 10 | | totalFloatingIpsUsed| 0 | | totalInstancesUsed | 18 | | totalSecurityGroupsUsed | 1 | | maxTotalFloatingIps | 10 | | maxTotalInstances | 30 | | totalCoresUsed | 18 | | maxTotalCores | 30 | +-++ $nova absolute-limits --tenant 11b2b129994844798c98f437d9809a9c +-++ | Name| Value | +-++ | maxServerMeta | 128| | maxPersonality | 5 | | maxImageMeta| 128| | maxPersonalitySize | 10240 | | maxTotalRAMSize | 151200 | | maxSecurityGroupRules | 20 | | maxTotalKeypairs| 100| | totalRAMUsed| 128| | maxSecurityGroups | 10 | | totalFloatingIpsUsed| 0 | | totalInstancesUsed | 2 | | totalSecurityGroupsUsed | 1 | | maxTotalFloatingIps | 10 | | maxTotalInstances | 30 | --- | totalCoresUsed | 2 | | maxTotalCores | 30 |--- +-++ note: arrows show the wrong values. Seems that maxTotal* shows the values for the current tenant and not what is specified by --tenant as expected. tested in havana and icehouse-1 ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1334278 Title: limits with tenant parameter returns wrong maxTotal* values Status in OpenStack Compute (Nova): New Bug description: When querying for the absolute limits of a specific tenant the maxTotal* values reported aren't correct. How to reproduce: for example using devstack... OS_TENANT_NAME=demo (11b2b129994844798c98f437d9809a9c) OS_USERNAME=demo $nova absolute-limits +-+---+ | Name| Value | +-+---+ | maxServerMeta | 128 | | maxPersonality | 5 | | maxImageMeta| 128 | | maxPersonalitySize | 10240 | | maxTotalRAMSize | 1000 | | maxSecurityGroupRules | 20| | maxTotalKeypairs| 100 | | totalRAMUsed| 128 | | maxSecurityGroups | 10| | totalFloatingIpsUsed| 0 | | totalInstancesUsed | 2 | | totalSecurityGroupsUsed | 1 | | maxTotalFloatingIps | 10| | maxTotalInstances | 10|--- | totalCoresUsed | 2 | | maxTotalCores | 10| --- +-+---+ OS_TENANT_NAME=admin (b0f08277004b43aab516ae7dbf36ff51) OS_USERNAME=admin $nova absolute-limits +-++ | Name| Value | +-++ | maxServerMeta | 128| | maxPersonality | 5 | | maxImageMeta| 128| | maxPersonalitySize | 10240 | | maxTotalRAMSize | 151200 | | maxSecurityGroupRules | 20 | | maxTotalKeypairs| 100| | totalRAMUsed| 1152 | | maxSecurityGroups | 10 | | totalFloatingIpsUsed| 0 | |
[Yahoo-eng-team] [Bug 1307223] [NEW] If target_cell path not valid instance stays in BUILD status
Public bug reported: Using cells and the target_cell filter. With the scheduler hint target_cell if path is not valid instance will stay in scheduling task state. nova cells shows the following trace: 2014-04-13 20:25:40.237 ERROR nova.cells.messaging [req-8bc1d2a7-92aa-48b6-afda-42f255e43904 demo demo] Error locating next hop for message: Inconsistency in cell routing: Unknown child when routing to region!other 2014-04-13 20:25:40.237 TRACE nova.cells.messaging Traceback (most recent call last): 2014-04-13 20:25:40.237 TRACE nova.cells.messaging File /opt/stack/nova/nova/cells/messaging.py, line 406, in process 2014-04-13 20:25:40.237 TRACE nova.cells.messaging next_hop = self._get_next_hop() 2014-04-13 20:25:40.237 TRACE nova.cells.messaging File /opt/stack/nova/nova/cells/messaging.py, line 387, in _get_next_hop 2014-04-13 20:25:40.237 TRACE nova.cells.messaging raise exception.CellRoutingInconsistency(reason=reason) 2014-04-13 20:25:40.237 TRACE nova.cells.messaging CellRoutingInconsistency: Inconsistency in cell routing: Unknown child when routing to region!other 2014-04-13 20:25:40.237 TRACE nova.cells.messaging Expected: instance state changes to ERROR status. ** Affects: nova Importance: Undecided Status: New ** Tags: cells -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1307223 Title: If target_cell path not valid instance stays in BUILD status Status in OpenStack Compute (Nova): New Bug description: Using cells and the target_cell filter. With the scheduler hint target_cell if path is not valid instance will stay in scheduling task state. nova cells shows the following trace: 2014-04-13 20:25:40.237 ERROR nova.cells.messaging [req-8bc1d2a7-92aa-48b6-afda-42f255e43904 demo demo] Error locating next hop for message: Inconsistency in cell routing: Unknown child when routing to region!other 2014-04-13 20:25:40.237 TRACE nova.cells.messaging Traceback (most recent call last): 2014-04-13 20:25:40.237 TRACE nova.cells.messaging File /opt/stack/nova/nova/cells/messaging.py, line 406, in process 2014-04-13 20:25:40.237 TRACE nova.cells.messaging next_hop = self._get_next_hop() 2014-04-13 20:25:40.237 TRACE nova.cells.messaging File /opt/stack/nova/nova/cells/messaging.py, line 387, in _get_next_hop 2014-04-13 20:25:40.237 TRACE nova.cells.messaging raise exception.CellRoutingInconsistency(reason=reason) 2014-04-13 20:25:40.237 TRACE nova.cells.messaging CellRoutingInconsistency: Inconsistency in cell routing: Unknown child when routing to region!other 2014-04-13 20:25:40.237 TRACE nova.cells.messaging Expected: instance state changes to ERROR status. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1307223/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1286527] [NEW] Quota usages update should check all usage in tenant not only per user
Public bug reported: After Grizzly - Havana upgrade the quota_usages table was wiped out due to bug #1245746 Quota_usages is then updated after a user creates/delete an instance. The problem is that quota_usages is updated per user in a tenant. For tenants that are shared by different users this means that users that didn't have created instances previous are able to use the full quota for the tenant. Example: instance quota for tenant_X = 10 user_a and user_b can create instances in tenant_X - user_a creates 8 instances; - user_b didn't have instances; - grizzly - havana upgrade (usage_quotas wipe) - user_b is able to create 10 instances Problematic for clouds that rely in tenant quotas and not billing directly users. Even if previous example is associated with bug #1245746 this can happen if a user quota usage for a tenant gets out of sync. Quota usages should be updated and sync considering all resources in the tenant and not only the resources of the user that is doing the request. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1286527 Title: Quota usages update should check all usage in tenant not only per user Status in OpenStack Compute (Nova): New Bug description: After Grizzly - Havana upgrade the quota_usages table was wiped out due to bug #1245746 Quota_usages is then updated after a user creates/delete an instance. The problem is that quota_usages is updated per user in a tenant. For tenants that are shared by different users this means that users that didn't have created instances previous are able to use the full quota for the tenant. Example: instance quota for tenant_X = 10 user_a and user_b can create instances in tenant_X - user_a creates 8 instances; - user_b didn't have instances; - grizzly - havana upgrade (usage_quotas wipe) - user_b is able to create 10 instances Problematic for clouds that rely in tenant quotas and not billing directly users. Even if previous example is associated with bug #1245746 this can happen if a user quota usage for a tenant gets out of sync. Quota usages should be updated and sync considering all resources in the tenant and not only the resources of the user that is doing the request. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1286527/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1282709] [NEW] Instance names always include the first uuid in cell environment when creating multiple instances
Public bug reported: When launching multiple instances using nova api in a cell environment (parent-child setup) the display_name always have the uuid of the first instance. Example: 1) instance_name-uuid-1-uuid-1 2) instance_name-uuid-1-uuid-2 3) instance_name-uuid-1-uuid-3 4) instance_name-uuid-1-uuid-4 Expected: 1) instance_name-uuid-1 2) instance_name-uuid-2 3) instance_name-uuid-3 4) instance_name-uuid-4 How to reproduce: * Have cell environment (default devstack with cells enabled is enough) * nova boot --image image_uuid --flavor flavor_name --num-instances 4 instance_name ** Affects: nova Importance: Undecided Assignee: Belmiro Moreira (moreira-belmiro-email-lists) Status: New ** Tags: cells ** Description changed: - When launching multiple instances using nova api in a cell environment (parent-child setup) + When launching multiple instances using nova api in a cell environment (parent-child setup) the hostnames always have the uuid of the first instance. Example: 1) instance_name-uuid-1-uuid-1 - 2) instance_name-uuid-1-uuid-2 - 3) instance_name-uuid-1-uuid-3 - 4) instance_name-uuid-1-uuid-4 Expected: 1) instance_name-uuid-1 - 2) instance_name-uuid-2 - 3) instance_name-uuid-3 - 4) instance_name-uuid-4 How to reproduce: - 1) Have cell environment (default devstack with cells enabled is enough) - 2) nova boot --image image_uuid --flavor flavor_name --num-instances 4 instance_name + * Have cell environment (default devstack with cells enabled is enough) + * nova boot --image image_uuid --flavor flavor_name --num-instances 4 instance_name ** Description changed: - When launching multiple instances using nova api in a cell environment (parent-child setup) - the hostnames always have the uuid of the first instance. + When launching multiple instances using nova api in a cell environment + (parent-child setup) the display_name always have the uuid of the first + instance. Example: 1) instance_name-uuid-1-uuid-1 2) instance_name-uuid-1-uuid-2 3) instance_name-uuid-1-uuid-3 4) instance_name-uuid-1-uuid-4 Expected: 1) instance_name-uuid-1 2) instance_name-uuid-2 3) instance_name-uuid-3 4) instance_name-uuid-4 How to reproduce: * Have cell environment (default devstack with cells enabled is enough) * nova boot --image image_uuid --flavor flavor_name --num-instances 4 instance_name ** Changed in: nova Assignee: (unassigned) = Belmiro Moreira (moreira-belmiro-email-lists) -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1282709 Title: Instance names always include the first uuid in cell environment when creating multiple instances Status in OpenStack Compute (Nova): New Bug description: When launching multiple instances using nova api in a cell environment (parent-child setup) the display_name always have the uuid of the first instance. Example: 1) instance_name-uuid-1-uuid-1 2) instance_name-uuid-1-uuid-2 3) instance_name-uuid-1-uuid-3 4) instance_name-uuid-1-uuid-4 Expected: 1) instance_name-uuid-1 2) instance_name-uuid-2 3) instance_name-uuid-3 4) instance_name-uuid-4 How to reproduce: * Have cell environment (default devstack with cells enabled is enough) * nova boot --image image_uuid --flavor flavor_name --num-instances 4 instance_name To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1282709/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1274169] [NEW] Nova libvirt driver uses the instance type ID instead the flavor ID when creating instances - problematic with cells
Public bug reported: For flavors in cells is needed to create the same flavor manually in all available cells using nova API. If for some reason we need to delete a flavor in a cell the “instance_types” tables will then be out of sync (different IDs for flavors). This blocks the instance creation using the libvirt driver because then the instance type ID for the flavor in the top cell will be different in the Child. I would expect that nova use the “flavor ID” that is defined by the admin instead the instance type ID. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1274169 Title: Nova libvirt driver uses the instance type ID instead the flavor ID when creating instances - problematic with cells Status in OpenStack Compute (Nova): New Bug description: For flavors in cells is needed to create the same flavor manually in all available cells using nova API. If for some reason we need to delete a flavor in a cell the “instance_types” tables will then be out of sync (different IDs for flavors). This blocks the instance creation using the libvirt driver because then the instance type ID for the flavor in the top cell will be different in the Child. I would expect that nova use the “flavor ID” that is defined by the admin instead the instance type ID. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1274169/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1274325] [NEW] Security-groups not working with cells using nova-network
Public bug reported: Security groups are not working with cells using nova-network. Only cell API database is updated when adding rules. These are not propagated into the children cells. ** Affects: nova Importance: Undecided Status: New ** Tags: cells ** Description changed: - Security groups are not working with cells using nova-network + Security groups are not working with cells using nova-network. Only cell API database is updated when adding rules. These are not propagated into the children cells. -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1274325 Title: Security-groups not working with cells using nova-network Status in OpenStack Compute (Nova): New Bug description: Security groups are not working with cells using nova-network. Only cell API database is updated when adding rules. These are not propagated into the children cells. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1274325/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp
[Yahoo-eng-team] [Bug 1164408] Re: Snapshot doesn't get hypervisor_type and vm_mode properties
** Changed in: nova Status: Triaged = Invalid -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1164408 Title: Snapshot doesn't get hypervisor_type and vm_mode properties Status in OpenStack Compute (Nova): Invalid Bug description: When a snapshot is created it only gets some properties from the base image. In fact it only gets the architecture propriety (if defined). This is a problem when using the schedule filter ImagesPropertiesFilter because it also can filter by hypervisor_type and vm_mode properties. I believe we can assume if the base image has requirements of architecture or hypervisor_type or vm_mode the snapshot should have them too... I'm using the LibvirtDriver. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1164408/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp