[Yahoo-eng-team] [Bug 2054329] [NEW] orphan allocations cause orphan resource providers and prevents compute service deletion

Robert Franzke Mon, 19 Feb 2024 08:44:22 -0800

Public bug reported:

Description
===========
It can happen, that there are orphan allocations against a resource provider.
E.g. when something went wrong during a migration.


During the deletion of a nova-compute-service, the nova-api tries to delete the 
resource-provider in placement aswell.
When the resource provider has still allocations against it, the deletion of 
the resource-provider will fail but the deletion of the nova-compute-service 
will be successfull.
This causes orphan resource-providers.

This is based on the try-catch around the deletion of the resource-provider:
https://opendev.org/openstack/nova/src/commit/6e510eb62e00c34e98a5245a6de2dd2955ffb57a/nova/api/openstack/compute/services.py#L321

If a new nova-compute-service with the same hostname gets created, it will not 
create a new resource provider as there is already one with the correct 
hostname.
This causes a mismatch between the ID of the nova-compute-service and the ID of 
the resource-provider.

If you now try to delete the new nova-compute-service, it will generate an 
'ValueError', due to this mismatch.
This also happens for all other requests to placement, where the 
resource_provider is referenced via the UUID instead of the name.

Steps to reproduce
==================
1. Generate orphaned allocations on a resource provider
Can be done by generating a random allocation:
```
openstack resource provider allocation set <random-uuid> 
--allocation="rp=<your-resource-provider-id>,VCPU=2" --project-id 
<your-project-id> --user-id <your-user-id>
```
2. Delete the nova-compute-service via the nova-api
3. Restart the nova-compute service, so a new nova-compute-service is created
4. You will start to see erros in the logs of placement/nova-api, regarding not 
finding the resource provider with the old UUID
5. Delete the nova-compute-service via the nova-api, this will generate a 500 
error and the nova-compute-service is not deleted.

Expected result
===============
No erros in the logs regarding not finding a resource-provider based on its ID.
The deletion of the recreated nova-compute-service should be succesfull.

Actual result
=============
We see erros in the log regarding not finding the resource provider:
```
An error occurred while updating COMPUTE_STATUS_DISABLED trait on compute node 
resource provider d5d7cf1c-51ea-4139-9fc3-6007ba58441e. The trait will be 
synchronized when the update_available_resource periodic task runs. Error: 
Failed to get traits for resource provider with UUID 
d5d7cf1c-51ea-4139-9fc3-6007ba58441e
```
We are not able to delete the newly created nova-compute-service, due to a 
ValueError as it is not able to find the resource-provider based on the 
nova-compute-service UUID.

Environment
===========
We are running Openstack Zed, but based on the Code the issue should be still 
present on the master branch.

** Affects: nova
     Importance: Undecided
         Status: New

** Description changed:

  Description
  ===========
  It can happen, that there are orphan allocations against a resource provider.
  E.g. when something went wrong during a migration.
  
  During the deletion of a nova-compute-service, the nova-api tries to delete 
the resource-provider in placement aswell.
  When the resource provider has still allocations against it, the deletion of 
the resource-provider will fail but the deletion of the nova-compute-service 
will be successfull.
  This causes orphan resource-providers.
  
  This is based on the try-catch around the deletion of the resource-provider:
  
https://opendev.org/openstack/nova/src/commit/6e510eb62e00c34e98a5245a6de2dd2955ffb57a/nova/api/openstack/compute/services.py#L321
  
  If a new nova-compute-service with the same hostname gets created, it will 
not create a new resource provider as there is already one with the correct 
hostname.
- This causes a mismatch between the ID of the nova-compute-service and the 
resource provider.
+ This causes a mismatch between the ID of the nova-compute-service and the ID 
of the resource-provider.
  
  If you now try to delete the new nova-compute-service, it will generate an 
'ValueError', due to this mismatch.
  This also happens for all other requests to placement, where the 
resource_provider is referenced via the UUID instead of the name.
  
  Steps to reproduce
  ==================
  1. Generate orphaned allocations on a resource provider
  Can be done by generating a random allocation:
  ```
  openstack resource provider allocation set <random-uuid> 
--allocation="rp=<your-resource-provider-id>,VCPU=2" --project-id 
<your-project-id> --user-id <your-user-id>
  ```
  2. Delete the nova-compute-service via the nova-api
  3. Restart the nova-compute service, so a new nova-compute-service is created
  4. You will start to see erros in the logs of placement/nova-api, regarding 
not finding the resource provider with the old UUID
  5. Delete the nova-compute-service via the nova-api, this will generate a 500 
error and the nova-compute-service is not deleted.
  
  Expected result
  ===============
  No erros in the logs regarding not finding a resource-provider based on its 
ID.
  The deletion of the recreated nova-compute-service should be succesfull.
  
  Actual result
  =============
  We see erros in the log regarding not finding the resource provider:
  ```
  An error occurred while updating COMPUTE_STATUS_DISABLED trait on compute 
node resource provider d5d7cf1c-51ea-4139-9fc3-6007ba58441e. The trait will be 
synchronized when the update_available_resource periodic task runs. Error: 
Failed to get traits for resource provider with UUID 
d5d7cf1c-51ea-4139-9fc3-6007ba58441e
  ```
  We are not able to delete the newly created nova-compute-service, due to a 
ValueError as it is not able to find the resource-provider based on the 
nova-compute-service UUID.
  
  Environment
  ===========
  We are running Openstack Zed, but based on the Code the issue should be still 
present on the master branch.

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2054329

Title:
  orphan allocations cause orphan resource providers and prevents
  compute service deletion

Status in OpenStack Compute (nova):
  New

Bug description:
  Description
  ===========
  It can happen, that there are orphan allocations against a resource provider.
  E.g. when something went wrong during a migration.

  During the deletion of a nova-compute-service, the nova-api tries to delete 
the resource-provider in placement aswell.
  When the resource provider has still allocations against it, the deletion of 
the resource-provider will fail but the deletion of the nova-compute-service 
will be successfull.
  This causes orphan resource-providers.

  This is based on the try-catch around the deletion of the resource-provider:
  
https://opendev.org/openstack/nova/src/commit/6e510eb62e00c34e98a5245a6de2dd2955ffb57a/nova/api/openstack/compute/services.py#L321

  If a new nova-compute-service with the same hostname gets created, it will 
not create a new resource provider as there is already one with the correct 
hostname.
  This causes a mismatch between the ID of the nova-compute-service and the ID 
of the resource-provider.

  If you now try to delete the new nova-compute-service, it will generate an 
'ValueError', due to this mismatch.
  This also happens for all other requests to placement, where the 
resource_provider is referenced via the UUID instead of the name.

  Steps to reproduce
  ==================
  1. Generate orphaned allocations on a resource provider
  Can be done by generating a random allocation:
  ```
  openstack resource provider allocation set <random-uuid> 
--allocation="rp=<your-resource-provider-id>,VCPU=2" --project-id 
<your-project-id> --user-id <your-user-id>
  ```
  2. Delete the nova-compute-service via the nova-api
  3. Restart the nova-compute service, so a new nova-compute-service is created
  4. You will start to see erros in the logs of placement/nova-api, regarding 
not finding the resource provider with the old UUID
  5. Delete the nova-compute-service via the nova-api, this will generate a 500 
error and the nova-compute-service is not deleted.

  Expected result
  ===============
  No erros in the logs regarding not finding a resource-provider based on its 
ID.
  The deletion of the recreated nova-compute-service should be succesfull.

  Actual result
  =============
  We see erros in the log regarding not finding the resource provider:
  ```
  An error occurred while updating COMPUTE_STATUS_DISABLED trait on compute 
node resource provider d5d7cf1c-51ea-4139-9fc3-6007ba58441e. The trait will be 
synchronized when the update_available_resource periodic task runs. Error: 
Failed to get traits for resource provider with UUID 
d5d7cf1c-51ea-4139-9fc3-6007ba58441e
  ```
  We are not able to delete the newly created nova-compute-service, due to a 
ValueError as it is not able to find the resource-provider based on the 
nova-compute-service UUID.

  Environment
  ===========
  We are running Openstack Zed, but based on the Code the issue should be still 
present on the master branch.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2054329/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 2054329] [NEW] orphan allocations cause orphan resource providers and prevents compute service deletion

Reply via email to