Public bug reported: I'm facing a similar issue to "https://bugs.launchpad.net/nova/+bug/1918419" but somehow different which makes me open a new bug.
I'm giving some context to this bug to better explain how this affects operations. Here's the story... When a compute node needs a hardware intervention we have an automated process that the repair team uses (they don't have access to OpenStack APIs) to live migrate all the instances before starting the repair. The motivation is to minimize the impact on users. However, instances can't be live migrated if the compute node becomes overcommitted! It happens that if a DIMM fails in a compute node that has all the memory allocated to VMs, it's not possible to move those VMs. "No valid host was found. Unable to replace instance claim on source (HTTP 400)" The compute node becomes overcommitted (because the DIMM is not visible anymore) and placement can't create the migration allocation in the source. The operator can workaround and "tune" the memory overcommit for the affected compute node, but that requires investigation and a manual intervention of an operator defeating automation and delegation to other teams. Extremely complicated in large deployments. I don't believe this behaviour is correct. If there are available resources to host the instances in a different compute node, placement shouldn't block the live migration because the source is overcommitted. +++ Using Nova Stein. For what I checked looks it's still the behaviour in recent releases. ** Affects: nova Importance: Undecided Status: New -- You received this bug notification because you are a member of Yahoo! Engineering Team, which is subscribed to OpenStack Compute (nova). https://bugs.launchpad.net/bugs/1924123 Title: If source compute node is overcommitted instances can't be migrated Status in OpenStack Compute (nova): New Bug description: I'm facing a similar issue to "https://bugs.launchpad.net/nova/+bug/1918419" but somehow different which makes me open a new bug. I'm giving some context to this bug to better explain how this affects operations. Here's the story... When a compute node needs a hardware intervention we have an automated process that the repair team uses (they don't have access to OpenStack APIs) to live migrate all the instances before starting the repair. The motivation is to minimize the impact on users. However, instances can't be live migrated if the compute node becomes overcommitted! It happens that if a DIMM fails in a compute node that has all the memory allocated to VMs, it's not possible to move those VMs. "No valid host was found. Unable to replace instance claim on source (HTTP 400)" The compute node becomes overcommitted (because the DIMM is not visible anymore) and placement can't create the migration allocation in the source. The operator can workaround and "tune" the memory overcommit for the affected compute node, but that requires investigation and a manual intervention of an operator defeating automation and delegation to other teams. Extremely complicated in large deployments. I don't believe this behaviour is correct. If there are available resources to host the instances in a different compute node, placement shouldn't block the live migration because the source is overcommitted. +++ Using Nova Stein. For what I checked looks it's still the behaviour in recent releases. To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1924123/+subscriptions -- Mailing list: https://launchpad.net/~yahoo-eng-team Post to : yahoo-eng-team@lists.launchpad.net Unsubscribe : https://launchpad.net/~yahoo-eng-team More help : https://help.launchpad.net/ListHelp