Public bug reported:

https://review.openstack.org/#/c/560459/ in Rocky changed the libvirt
driver such that if the compute node provider is in a shared storage
provider aggregate relationship (in the same aggregate with a resource
provider that has DISK_GB inventory and the MISC_SHARES_VIA_AGGREGATE
trait), the compute node provider won't report DISK_GB inventory.

There are at least two major issues with this:

1. On upgrade from Queens, any existing allocations against the compute
node provider's DISK_GB inventory will not allow removal of the DISK_GB
inventory from the compute node provider during the
update_available_resource periodic task. In other words, we have no data
migration routine in place to move DISK_GB allocations from the compute
node provider to the shared storage provider in Rocky.

2. During a move operation, we move the instance's allocations from the
source compute node provider to the migration record, then go through
the scheduler to pick a dest host for the instance and allocate
resources against the dest host (and optionally shared storage
provider). So:

a) The DISK_GB allocation from the instance to the shared storage
provider is deleted for a short window of time during scheduling until
we pick a dest host.

https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/tasks/migrate.py#L57

b) If cold migrate fails or is reverted, we delete the allocations
(created by the scheduler) and move the allocations from the migration
record (against the source node provider) back to the instance, but
because we failed to move the DISK_GB allocation against the sharing
provider for the instance to the migration record, we've lost that
DISK_GB allocation when copying it back to the instance on
revert/failure:

https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/compute/manager.py#L4155

--

We could also have issues with how forced live migrate:

https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/tasks/live_migrate.py#L109

And evacuate:

https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/manager.py#L868

bypass the scheduler altogether so we're potentially not handling shared
provider allocations there either.

Also, we don't have *any* shared storage provider CI jobs setup. A start
to that is here:

https://review.openstack.org/#/c/586363/

But that's just a single-node job at the moment and we'd need a multi-
node shared storage CI job to really say we support shared storage
providers as a feature in nova.

** Affects: nova
     Importance: High
         Status: Triaged


** Tags: libvirt placement rocky-rc-potential shared-storage

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1784020

Title:
  Shared storage providers are not supported and will break things if
  used

Status in OpenStack Compute (nova):
  Triaged

Bug description:
  https://review.openstack.org/#/c/560459/ in Rocky changed the libvirt
  driver such that if the compute node provider is in a shared storage
  provider aggregate relationship (in the same aggregate with a resource
  provider that has DISK_GB inventory and the MISC_SHARES_VIA_AGGREGATE
  trait), the compute node provider won't report DISK_GB inventory.

  There are at least two major issues with this:

  1. On upgrade from Queens, any existing allocations against the
  compute node provider's DISK_GB inventory will not allow removal of
  the DISK_GB inventory from the compute node provider during the
  update_available_resource periodic task. In other words, we have no
  data migration routine in place to move DISK_GB allocations from the
  compute node provider to the shared storage provider in Rocky.

  2. During a move operation, we move the instance's allocations from
  the source compute node provider to the migration record, then go
  through the scheduler to pick a dest host for the instance and
  allocate resources against the dest host (and optionally shared
  storage provider). So:

  a) The DISK_GB allocation from the instance to the shared storage
  provider is deleted for a short window of time during scheduling until
  we pick a dest host.

  
https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/tasks/migrate.py#L57

  b) If cold migrate fails or is reverted, we delete the allocations
  (created by the scheduler) and move the allocations from the migration
  record (against the source node provider) back to the instance, but
  because we failed to move the DISK_GB allocation against the sharing
  provider for the instance to the migration record, we've lost that
  DISK_GB allocation when copying it back to the instance on
  revert/failure:

  
https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/compute/manager.py#L4155

  --

  We could also have issues with how forced live migrate:

  
https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/tasks/live_migrate.py#L109

  And evacuate:

  
https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/manager.py#L868

  bypass the scheduler altogether so we're potentially not handling
  shared provider allocations there either.

  Also, we don't have *any* shared storage provider CI jobs setup. A
  start to that is here:

  https://review.openstack.org/#/c/586363/

  But that's just a single-node job at the moment and we'd need a multi-
  node shared storage CI job to really say we support shared storage
  providers as a feature in nova.

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1784020/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

Reply via email to