[ClusterLabs] Failed VM (libvirt) stop

Ulrich Windl Fri, 06 Aug 2021 06:24:47 -0700

Hi!

This is not strictly a cluster question, but a resource agent question:
I had a case when a Xen PVM could not be stopped when it was in either GRUB or 
early boot phase.
I noticed that he VM would not stop while being connected to the (text) 
console, so I inspected "xentop":
There the VM has "s"-state (shutdown).
The console output was:
---
Loading Linux 5.3.18-59.16-default ...
Loading initial ramdisk ...
[    2.038092] Cannot find an available gap in the 32-bit address range
[    2.038094] PCI devices with unassigned 32-bit BARs may not work!
[    2.490713] reboot: Power down
---


Local log messages were:
Aug 06 08:25:03 h19 VirtualDomain(prm_xen_v01)[10468]: INFO: Issuing graceful 
shutdown request for domain v01.
Aug 06 08:25:28 h19 kernel: xen-blkback: backend/vbd/25/51744: prepare for 
reconnect
Aug 06 08:25:28 h19 kernel: xen-blkback: backend/vbd/25/51760: prepare for 
reconnect
Aug 06 08:30:03 h19 pacemaker-execd[11667]:  warning: prm_xen_v01_stop_0 
process (PID 10435) timed out
Aug 06 08:30:03 h19 pacemaker-execd[11667]:  warning: prm_xen_v01_stop_0[10435] 
timed out after 300000ms
Aug 06 08:30:03 h19 pacemaker-execd[11667]:  notice: prm_xen_v01 stop (call 
337, PID 10435) exited with status 1 (execution time 300007ms, queue time 0ms)
Aug 06 08:30:03 h19 pacemaker-controld[11670]:  error: Result of stop operation 
for prm_xen_v01 on h19: Timed Out
Aug 06 08:30:03 h19 libvirtd[13675]: End of file while reading data: 
Input/output error

Is that a problem in Xen, libvirt or the RA?
Specifically I'm missing a forced shutdown (like "m destroy" before the stop 
timed out.

The RA doc says: "The default behavior is to resort to a forceful shutdown only 
after a graceful
shutdown attempt has failed."

Browsing the RA, I suspect that when either "virsh shutdown" is waiting for 
completion or VirtualDomain_status is hanging, then the "timeout loop" (after 
which force_stop will be called) does not finish before the cluster times out 
the operation.
The tijmeout code (shutdown_timeout=$(( $NOW + 
($OCF_RESKEY_CRM_meta_timeout/1000) -5 ))) allows 5 extra seconds from the 
start of the RA (where NOW is set) for all the processing.
So if you spend 2 seconds until the while loop start, and you spend three more 
extra seconds while waiting for 5 minutes (300s), the cluster will timeout the 
stop before the RA makes ist final attempt.
That might be a little tight IMHO.

In contrast the older Xen RA uses 1/3rd of the timeout as safety margin:
$((OCF_RESKEY_CRM_meta_timeout/1500))

Any splendid insights?

Regards,
Ulrich



_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Failed VM (libvirt) stop

Reply via email to