[Yahoo-eng-team] [Bug 1908133] [NEW] Nova does not track shared ceph pools across multiple nodes

Rodrigo Barbieri Mon, 14 Dec 2020 13:11:12 -0800

Public bug reported:

Environment:
- tested in focal-victoria and bionic-stein


======================

Steps to reproduce:
1) Deploy OpenStack having 2 nova-compute nodes
2) Configure both compute nodes to have a RBD backend pointing to the same pool 
in ceph as below:

[libvirt]
images_type = rbd
images_rbd_pool = nova

3) run "openstack hypervisor show" on each node. Both will show the full
pool capacity:

local_gb             | 29
local_gb_used        | 0
free_disk_gb         | 29
disk_available_least | 15

4) create a 20gb instance and run "openstack hypervisor show" again on
the node it landed:

local_gb             | 29
local_gb_used        | 20
free_disk_gb         | 9
disk_available_least | 15

5) create another 20GB one. It will land on the other hypervisor
6) try to create a third 20GB one, it will fail because placement will not 
return an allocation candidate. This is correct.
7) Now ssh to both the instances and fill their disk (actually based on 
disk_available_least that is read from ceph df, only one may need to be filled)
8) I/O for all instances will be frozen as the ceph pool runs out of space, and 
the nova-compute service freezes on "create_image" whenever a new instance is 
attempted to be created there, causing it to be reported as "down".
9) disk_available_least will be updated to 0, but that doesn't prevent new 
instances from being scheduled.

This is the first problem as both compute nodes have their tracking
disconnected from the ceph pool on "free_disk_gb" and "local_gb_used",
while "disk_available_least" is not used by the scheduler to prevent the
problem while disk_allocation_ratio is 1.0 (it is used by live-migration
appropriately though).

Alternatively (as a possible solution/fix/workaround), following the
steps in [0] and [1] to have placement as a centralized place for the
shared ceph pool. I ran the following steps:

10) openstack resource provider create ceph_nova_pool

11) openstack resource provider inventory set --os-placement-api-version
1.19 --resource DISK_GB=30 <ceph_nova_pool_uuid>

12) openstack resource provider trait set --os-placement-api-version
1.19 <ceph_nova_pool_uuid> --trait MISC_SHARES_VIA_AGGREGATE

13) openstack resource provider aggregate set <ceph_nova_pool_uuid>
--aggregate <resource_provider1_uuid> --aggregate
<resource_provider2_uuid> --generation 2 --os-placement-api-version 1.19

14) Deleted all instances and repeated steps 4, 5 and 6 but same result

15) openstack resource provider set --name <resource_provider1_name>
--parent-provider <ceph_nova_pool_uuid> <resource_provider1_uuid> --os-
placement-api-version 1.19

16) openstack resource provider set --name <resource_provider2_name>
--parent-provider <ceph_nova_pool_uuid> <resource_provider2_uuid> --os-
placement-api-version 1.19

17) Deleted all instances and repeated steps 4, 5 and 6. Now I was able
to create 3 instances, where 1 of them had allocations from the
ceph_nova_pool resource provider. The created resource_provider is being
treated as an "extra" resource provider.

18) Deleted 2 instances that had allocations from the compute nodes

19) openstack resource provider inventory delete
<resource_provider1_uuid> --resource-class DISK_GB

20) openstack resource provider inventory delete
<resource_provider1_uuid> --resource-class DISK_GB

21) watch openstack allocation candidate list --resource DISK_GB=20
--os-placement-api-version 1.19

Now, the list would be empty, until nova-compute periodically updates
the inventory with its local_gb value and we go back to the state at
step 17.


======================

Expected result:
- For the first approach, it is expected that scheduling would be affected by 
the disk_available_least value (accordingly to disk_allocation_ratio as well) 
to avoid allowing the creation of instances when there is no space.
- For the second approach, it is expected that there is a way to prevent 
nova-compute when periodically updating a specific inventory, or guarantee that 
its inventory is shared with another resource_provider instead of an "extra" 
one.


[0] 
https://github.com/openstack/placement/blob/c02a073c523d363d7136677ab12884dc4ec03e6f/placement/objects/research_context.py#L1107
[1] https://docs.openstack.org/placement/latest/user/provider-tree.html

** Affects: nova
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/1908133

Title:
  Nova does not track shared ceph pools across multiple nodes

Status in OpenStack Compute (nova):
  New

Bug description:
  Environment:
  - tested in focal-victoria and bionic-stein

  ======================

  Steps to reproduce:
  1) Deploy OpenStack having 2 nova-compute nodes
  2) Configure both compute nodes to have a RBD backend pointing to the same 
pool in ceph as below:

  [libvirt]
  images_type = rbd
  images_rbd_pool = nova

  3) run "openstack hypervisor show" on each node. Both will show the
  full pool capacity:

  local_gb             | 29
  local_gb_used        | 0
  free_disk_gb         | 29
  disk_available_least | 15

  4) create a 20gb instance and run "openstack hypervisor show" again on
  the node it landed:

  local_gb             | 29
  local_gb_used        | 20
  free_disk_gb         | 9
  disk_available_least | 15

  5) create another 20GB one. It will land on the other hypervisor
  6) try to create a third 20GB one, it will fail because placement will not 
return an allocation candidate. This is correct.
  7) Now ssh to both the instances and fill their disk (actually based on 
disk_available_least that is read from ceph df, only one may need to be filled)
  8) I/O for all instances will be frozen as the ceph pool runs out of space, 
and the nova-compute service freezes on "create_image" whenever a new instance 
is attempted to be created there, causing it to be reported as "down".
  9) disk_available_least will be updated to 0, but that doesn't prevent new 
instances from being scheduled.

  This is the first problem as both compute nodes have their tracking
  disconnected from the ceph pool on "free_disk_gb" and "local_gb_used",
  while "disk_available_least" is not used by the scheduler to prevent
  the problem while disk_allocation_ratio is 1.0 (it is used by live-
  migration appropriately though).

  Alternatively (as a possible solution/fix/workaround), following the
  steps in [0] and [1] to have placement as a centralized place for the
  shared ceph pool. I ran the following steps:

  10) openstack resource provider create ceph_nova_pool

  11) openstack resource provider inventory set --os-placement-api-
  version 1.19 --resource DISK_GB=30 <ceph_nova_pool_uuid>

  12) openstack resource provider trait set --os-placement-api-version
  1.19 <ceph_nova_pool_uuid> --trait MISC_SHARES_VIA_AGGREGATE

  13) openstack resource provider aggregate set <ceph_nova_pool_uuid>
  --aggregate <resource_provider1_uuid> --aggregate
  <resource_provider2_uuid> --generation 2 --os-placement-api-version
  1.19

  14) Deleted all instances and repeated steps 4, 5 and 6 but same
  result

  15) openstack resource provider set --name <resource_provider1_name>
  --parent-provider <ceph_nova_pool_uuid> <resource_provider1_uuid>
  --os-placement-api-version 1.19

  16) openstack resource provider set --name <resource_provider2_name>
  --parent-provider <ceph_nova_pool_uuid> <resource_provider2_uuid>
  --os-placement-api-version 1.19

  17) Deleted all instances and repeated steps 4, 5 and 6. Now I was
  able to create 3 instances, where 1 of them had allocations from the
  ceph_nova_pool resource provider. The created resource_provider is
  being treated as an "extra" resource provider.

  18) Deleted 2 instances that had allocations from the compute nodes

  19) openstack resource provider inventory delete
  <resource_provider1_uuid> --resource-class DISK_GB

  20) openstack resource provider inventory delete
  <resource_provider1_uuid> --resource-class DISK_GB

  21) watch openstack allocation candidate list --resource DISK_GB=20
  --os-placement-api-version 1.19

  Now, the list would be empty, until nova-compute periodically updates
  the inventory with its local_gb value and we go back to the state at
  step 17.

  
  ======================

  Expected result:
  - For the first approach, it is expected that scheduling would be affected by 
the disk_available_least value (accordingly to disk_allocation_ratio as well) 
to avoid allowing the creation of instances when there is no space.
  - For the second approach, it is expected that there is a way to prevent 
nova-compute when periodically updating a specific inventory, or guarantee that 
its inventory is shared with another resource_provider instead of an "extra" 
one.


  [0] 
https://github.com/openstack/placement/blob/c02a073c523d363d7136677ab12884dc4ec03e6f/placement/objects/research_context.py#L1107
  [1] https://docs.openstack.org/placement/latest/user/provider-tree.html

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/1908133/+subscriptions

-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : yahoo-eng-team@lists.launchpad.net
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 1908133] [NEW] Nova does not track shared ceph pools across multiple nodes

Reply via email to