[Yahoo-eng-team] [Bug 2102034] [NEW] Some instance artefacts remain after evacuation

Mitya Eremeev Tue, 11 Mar 2025 09:21:49 -0700

Public bug reported:

1) deploy several vms in hypervisor, e.g. 25
for i in {1..25}; do openstack --os-compute-api-version 2.74 server create 
--flavor m1.tiny --image cirros-0.6.3-x86_64-disk --network private1 --host 
comp-01 vm_$i; done


2) Fail hypervisor which hosts vms
echo c> /proc/sysrq-trigger

3) Wait until compute service is "down"
openstack compute service list --service nova-compute --long

4) Do "hard reboot" for failed hypervisor (via Horizon) and immediately 
evacuate all vms
for i in {1..25}; do openstack server evacuate vm_$i; done

5) Wait until all vms are evacuated
openstack server list --long

6) Wait until compute service is "up"
openstack compute service list --service nova-compute --long

7) Check evacuated vms artefacts in failed hypervisor, e.g.
ls /var/lib/docker/volumes/nova_compute/_data/instances

8) Try live-migrate all vms back to failed hypervisor
for i in {1..25}; do openstack --os-compute-api-version 2.30 server migrate 
--live-migration --block-migration --host comp-01 vm_$i ; done

Expected:
no artefacts of evacuated vms and successful live migrations.

Actual:
There are artefacts of evacuated vms and live migrations are successfull on the 
second try.


Troubleshooting:

1) During initialization compute service tries to destroy evacuated vms 
artefacts.
The service processes evacuations with all statuses, except failed or completed.
The service checks whether an instance storage is shared.
If it's shared, then instance disk is not destroyed.
The service create temp file in instance folder and checks whether new instance 
host "sees" it (via RPC request). If "yes" or no reply the service "thinks" 
instance storage is shared.
2) If evacuation started recently then RPC request always times out.
The service does not destroy instance artefacts and set "completed" status for 
evacuation.
This evacuation will never be processed again, instance artefacts are there 
forever.
3) New evacuations which start during evacuation cleanup are not processed too.
4) If some evacuation "is gonna" to fail the service will always get RPC 
timeout and does №2

** Affects: nova
     Importance: Undecided
     Assignee: Mitya Eremeev (mitos)
         Status: In Progress

** Changed in: nova
     Assignee: (unassigned) => Mitya Eremeev (mitos)

** Changed in: nova
       Status: New => In Progress

-- 
You received this bug notification because you are a member of Yahoo!
Engineering Team, which is subscribed to OpenStack Compute (nova).
https://bugs.launchpad.net/bugs/2102034

Title:
  Some instance artefacts remain after evacuation

Status in OpenStack Compute (nova):
  In Progress

Bug description:
  1) deploy several vms in hypervisor, e.g. 25
  for i in {1..25}; do openstack --os-compute-api-version 2.74 server create 
--flavor m1.tiny --image cirros-0.6.3-x86_64-disk --network private1 --host 
comp-01 vm_$i; done

  2) Fail hypervisor which hosts vms
  echo c> /proc/sysrq-trigger

  3) Wait until compute service is "down"
  openstack compute service list --service nova-compute --long

  4) Do "hard reboot" for failed hypervisor (via Horizon) and immediately 
evacuate all vms
  for i in {1..25}; do openstack server evacuate vm_$i; done

  5) Wait until all vms are evacuated
  openstack server list --long

  6) Wait until compute service is "up"
  openstack compute service list --service nova-compute --long

  7) Check evacuated vms artefacts in failed hypervisor, e.g.
  ls /var/lib/docker/volumes/nova_compute/_data/instances

  8) Try live-migrate all vms back to failed hypervisor
  for i in {1..25}; do openstack --os-compute-api-version 2.30 server migrate 
--live-migration --block-migration --host comp-01 vm_$i ; done

  Expected:
  no artefacts of evacuated vms and successful live migrations.

  Actual:
  There are artefacts of evacuated vms and live migrations are successfull on 
the second try.

  
  Troubleshooting:

  1) During initialization compute service tries to destroy evacuated vms 
artefacts.
  The service processes evacuations with all statuses, except failed or 
completed.
  The service checks whether an instance storage is shared.
  If it's shared, then instance disk is not destroyed.
  The service create temp file in instance folder and checks whether new 
instance host "sees" it (via RPC request). If "yes" or no reply the service 
"thinks" instance storage is shared.
  2) If evacuation started recently then RPC request always times out.
  The service does not destroy instance artefacts and set "completed" status 
for evacuation.
  This evacuation will never be processed again, instance artefacts are there 
forever.
  3) New evacuations which start during evacuation cleanup are not processed 
too.
  4) If some evacuation "is gonna" to fail the service will always get RPC 
timeout and does №2

To manage notifications about this bug go to:
https://bugs.launchpad.net/nova/+bug/2102034/+subscriptions


-- 
Mailing list: https://launchpad.net/~yahoo-eng-team
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~yahoo-eng-team
More help   : https://help.launchpad.net/ListHelp

[Yahoo-eng-team] [Bug 2102034] [NEW] Some instance artefacts remain after evacuation

Reply via email to