This was fixed in queens nova package version 2:17.0.10-0ubuntu2~cloud0 ** Changed in: cloud-archive/queens Status: Fix Committed => Fix Released
** Changed in: cloud-archive/train Status: Fix Committed => Fix Released ** Changed in: nova (Ubuntu Eoan) Status: Fix Committed => Fix Released -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1821594 Title: [SRU] Error in confirm_migration leaves stale allocations and 'confirming' migration state Status in Ubuntu Cloud Archive: Fix Released Status in Ubuntu Cloud Archive queens series: Fix Released Status in Ubuntu Cloud Archive rocky series: Fix Released Status in Ubuntu Cloud Archive stein series: Fix Released Status in Ubuntu Cloud Archive train series: Fix Released Status in OpenStack Compute (nova): Fix Released Status in OpenStack Compute (nova) pike series: Triaged Status in OpenStack Compute (nova) queens series: Fix Committed Status in OpenStack Compute (nova) rocky series: Fix Committed Status in OpenStack Compute (nova) stein series: Fix Released Status in nova package in Ubuntu: Fix Released Status in nova source package in Bionic: Fix Released Status in nova source package in Cosmic: Fix Released Status in nova source package in Disco: Fix Released Status in nova source package in Eoan: Fix Released Bug description: Description: When performing a cold migration, if an exception is raised by the driver during confirm_migration (this runs in the source node), the migration record is stuck in "confirming" state and the allocations against the source node are not removed. The instance is fine at the destination in this stage, but the source host has allocations that is not possible to clean without going to the database or invoking the Placement API via curl. After several migration attempts that fail in the same spot, the source node is filled with these allocations that prevent new instances from being created or instances migrated to this node. When confirm_migration fails in this stage, the migrating instance can be saved through a hard reboot or a reset state to active. Steps to reproduce: Unfortunately, I don't have logs of the real root cause of the problem inside driver.confirm_migration running libvirt driver. However, the stale allocations and migration status problem can be easily reproduced by raising an exception in libvirt driver's confirm_migration method, and it would affect any driver. Expected results: Discussed this issue with efried and mriedem over #openstack-nova on March 25th, 2019. They confirmed that allocations not being cleared up is a bug. Actual results: Instance is fine at the destination after a reset-state. Source node has stale allocations that prevent new instances from being created/migrated to the source node. Migration record is stuck in "confirming" state. Environment: I verified this bug on on pike, queens and stein branches. Running libvirt KVM driver. ======================================================================= [Impact] If users attempting to perform cold migrations face any issues when the virt driver is running the "Confirm Migration" step, the failure leaves stale allocation records in the database, in migration records in "confirming" state. The stale allocations are not cleaned up by nova, consuming the user's quota indefinitely. This bug was confirmed from pike to stein release, and a fix was implemented for queens, rocky and stein. It should be backported to those releases to prevent the issue from reoccurring. This fix prevents new stale allocations being left over, by cleaning them up immediately when the failures occur. At the moment, the users affected by this bug have to clean their previous stale allocations manually. [Test Case] 1. Reproducing the bug 1a. Inject failure The root cause for this problem may vary for each driver and environment, so to reproduce the bug, it is necessary first to inject a failure in the driver's confirm_migration method to cause an exception to be raised. An example when using libvirt is to add a line: raise Exception("TEST") in https://github.com/openstack/nova/blob/a57b990c6bffa4c7447081b86573972866c696d2/nova/virt/libvirt/driver.py#L9012 1b. Restart nova-compute service: systemctl restart nova-compute 1c. Create a VM 1d. Then, invoke a cold migration: "openstack server migrate {id}" 1e. Wait for instance status: VERIFY_RESIZE 1f. Invoke "openstack server resize {id} --confirm" 1g. Wait for instance status: ERROR 1h. Check migration stuck in "confirming" status: nova migration-list 1i. Check allocations, you should see 2 allocations, one with the VM ID, the other with the migration uuid export ENDPOINT=<placement_endpoint> export TOKEN=`openstack token issue| grep ' id '| awk '{print $4}'` for id in $(curl -k -s -X GET $ENDPOINT/resource_providers -H "Accept: application/json" -H "X-Auth-Token: $TOKEN" -H "OpenStack-API-Version: placement 1.17" | jq -r .resource_providers[].uuid); do curl -k -s -X GET $ENDPOINT/resource_providers/$id/allocations -H "Accept: application/json" -H "X-Auth-Token: $TOKEN" -H "OpenStack-API-Version: placement 1.17" | jq [.allocations]; done 2. Cleanup 2a. Delete the VM 2b. Delete the stale allocation: export ID=<migration_uuid> curl -k -s -X DELETE $ENDPOINT/allocations/$ID -H "Accept: application/json" -H "X-Auth-Token: $TOKEN" -H "OpenStack-API-Version: placement placement 1.17" 3. Install package that contains the fixed code 4. Confirm bug is fixed 4a. Repeat steps 1a through 1g 4b. Check migration with "error" status: nova migration-list 4c. Check allocations, you should see only 1 allocation with the VM ID for id in $(curl -k -s -X GET $ENDPOINT/resource_providers -H "Accept: application/json" -H "X-Auth-Token: $TOKEN" -H "OpenStack-API-Version: placement 1.17" | jq -r .resource_providers[].uuid); do curl -k -s -X GET $ENDPOINT/resource_providers/$id/allocations -H "Accept: application/json" -H "X-Auth-Token: $TOKEN" -H "OpenStack-API-Version: placement 1.17" | jq [.allocations]; done 5. Cleanup 5a. Delete the VM [Regression Potential] New functional test https://review.opendev.org/#/c/657870/ validated the fix and was backported all the way to Queens. The fix being backported caused no functional test to fail. [Other Info] None To manage notifications about this bug go to: https://bugs.launchpad.net/cloud-archive/+bug/1821594/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp