Public bug reported: I noticed this while working on a functional test to recreate a bug during resize reschedule:
https://review.opendev.org/#/c/686017/ And discussed a bit in IRC: http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack- nova.2019-10-01.log.html#t2019-10-01T16:33:27 The issue is that we can start a resize (or cold migration) of a stopped or active (normally active) server and fail a resize claim in the compute service due to some race issue or for resource claims that are not handled by placement yet, like NUMA and PCI devices: https://github.com/openstack/nova/blob/4d18b29c95e3862c68ab41a4c090eb30c32a037a/nova/compute/manager.py#L4527 That ResourceTracker.resize_claim can raise ComputeResourcesUnavailable which is handled here: https://github.com/openstack/nova/blob/4d18b29c95e3862c68ab41a4c090eb30c32a037a/nova/compute/manager.py#L4610 We may try to reschedule but if rescheduling fails, or we don't reschedule, the instance is set to error state by this context manager: https://github.com/openstack/nova/blob/4d18b29c95e3862c68ab41a4c090eb30c32a037a/nova/compute/manager.py#L4592 That will set the instance vm_state to error: https://github.com/openstack/nova/blob/4d18b29c95e3862c68ab41a4c090eb30c32a037a/nova/compute/manager.py#L8809 If we failed a resize claim, there is actually no change in the guest, same like if we failed a cold migration because the scheduler selected the same host and the virt driver does not support that, see: https://github.com/openstack/nova/blob/4d18b29c95e3862c68ab41a4c090eb30c32a037a/nova/compute/manager.py#L4489 If _prep_resize raises InstanceFaultRollback the _error_out_instance_on_exception will handle it differently since https://review.opendev.org/#/c/633212/ and not put the instance into ERROR state but revert the vm_state to its previous value (active or stopped). If the guest is not changed I don't think the instance should be in ERROR status because of a resize claim failure, but opinions on that differ, e.g.: (11:40:45 AM) mriedem: dansmith: ok, but still, the user shouldn't have to stop and then start to get out of that, or hard reboot, when the thing that failed is a resize claim race (11:41:03 AM) dansmith: mriedem: so maybe it's just stop I'm thinking of.. anyway, I dunno.. it's very annoying as a user to do something, come back later and have it not obvious that the thing has happened, or failed or whatever (11:41:52 AM) dansmith: mriedem: if you're going to retry the operation for them, I agree. if you're not, then being super obvious about what has happened is best, IMHO If we aren't going to automatically handle the resize claim failure and not set the instance to error state, then we should at least have something in the API reference documentation about post-conditions for resize and cold migrate actions such that if the instance is in ERROR state and there is a fault for the resize claim failure, the user can stop/start or hard reboot the server to reset its status. I do think we have some precedence in handling non-error conditions like this though since https://review.opendev.org/#/c/633227/. This is latent behavior so I'm going to mark it low priority but I wanted to make sure we have a bug reported for it. ** Affects: nova Importance: Low Status: Triaged ** Tags: resize -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1846262 Title: Failed resize claim leaves otherwise active instance in ERROR state Status in OpenStack Compute (nova): Triaged Bug description: I noticed this while working on a functional test to recreate a bug during resize reschedule: https://review.opendev.org/#/c/686017/ And discussed a bit in IRC: http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack- nova.2019-10-01.log.html#t2019-10-01T16:33:27 The issue is that we can start a resize (or cold migration) of a stopped or active (normally active) server and fail a resize claim in the compute service due to some race issue or for resource claims that are not handled by placement yet, like NUMA and PCI devices: https://github.com/openstack/nova/blob/4d18b29c95e3862c68ab41a4c090eb30c32a037a/nova/compute/manager.py#L4527 That ResourceTracker.resize_claim can raise ComputeResourcesUnavailable which is handled here: https://github.com/openstack/nova/blob/4d18b29c95e3862c68ab41a4c090eb30c32a037a/nova/compute/manager.py#L4610 We may try to reschedule but if rescheduling fails, or we don't reschedule, the instance is set to error state by this context manager: https://github.com/openstack/nova/blob/4d18b29c95e3862c68ab41a4c090eb30c32a037a/nova/compute/manager.py#L4592 That will set the instance vm_state to error: https://github.com/openstack/nova/blob/4d18b29c95e3862c68ab41a4c090eb30c32a037a/nova/compute/manager.py#L8809 If we failed a resize claim, there is actually no change in the guest, same like if we failed a cold migration because the scheduler selected the same host and the virt driver does not support that, see: https://github.com/openstack/nova/blob/4d18b29c95e3862c68ab41a4c090eb30c32a037a/nova/compute/manager.py#L4489 If _prep_resize raises InstanceFaultRollback the _error_out_instance_on_exception will handle it differently since https://review.opendev.org/#/c/633212/ and not put the instance into ERROR state but revert the vm_state to its previous value (active or stopped). If the guest is not changed I don't think the instance should be in ERROR status because of a resize claim failure, but opinions on that differ, e.g.: (11:40:45 AM) mriedem: dansmith: ok, but still, the user shouldn't have to stop and then start to get out of that, or hard reboot, when the thing that failed is a resize claim race (11:41:03 AM) dansmith: mriedem: so maybe it's just stop I'm thinking of.. anyway, I dunno.. it's very annoying as a user to do something, come back later and have it not obvious that the thing has happened, or failed or whatever (11:41:52 AM) dansmith: mriedem: if you're going to retry the operation for them, I agree. if you're not, then being super obvious about what has happened is best, IMHO If we aren't going to automatically handle the resize claim failure and not set the instance to error state, then we should at least have something in the API reference documentation about post-conditions for resize and cold migrate actions such that if the instance is in ERROR state and there is a fault for the resize claim failure, the user can stop/start or hard reboot the server to reset its status. I do think we have some precedence in handling non-error conditions like this though since https://review.opendev.org/#/c/633227/. This is latent behavior so I'm going to mark it low priority but I wanted to make sure we have a bug reported for it. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1846262/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp