Hi! This is not strictly a cluster question, but a resource agent question: I had a case when a Xen PVM could not be stopped when it was in either GRUB or early boot phase. I noticed that he VM would not stop while being connected to the (text) console, so I inspected "xentop": There the VM has "s"-state (shutdown). The console output was: --- Loading Linux 5.3.18-59.16-default ... Loading initial ramdisk ... [ 2.038092] Cannot find an available gap in the 32-bit address range [ 2.038094] PCI devices with unassigned 32-bit BARs may not work! [ 2.490713] reboot: Power down ---
Local log messages were: Aug 06 08:25:03 h19 VirtualDomain(prm_xen_v01)[10468]: INFO: Issuing graceful shutdown request for domain v01. Aug 06 08:25:28 h19 kernel: xen-blkback: backend/vbd/25/51744: prepare for reconnect Aug 06 08:25:28 h19 kernel: xen-blkback: backend/vbd/25/51760: prepare for reconnect Aug 06 08:30:03 h19 pacemaker-execd[11667]: warning: prm_xen_v01_stop_0 process (PID 10435) timed out Aug 06 08:30:03 h19 pacemaker-execd[11667]: warning: prm_xen_v01_stop_0[10435] timed out after 300000ms Aug 06 08:30:03 h19 pacemaker-execd[11667]: notice: prm_xen_v01 stop (call 337, PID 10435) exited with status 1 (execution time 300007ms, queue time 0ms) Aug 06 08:30:03 h19 pacemaker-controld[11670]: error: Result of stop operation for prm_xen_v01 on h19: Timed Out Aug 06 08:30:03 h19 libvirtd[13675]: End of file while reading data: Input/output error Is that a problem in Xen, libvirt or the RA? Specifically I'm missing a forced shutdown (like "m destroy" before the stop timed out. The RA doc says: "The default behavior is to resort to a forceful shutdown only after a graceful shutdown attempt has failed." Browsing the RA, I suspect that when either "virsh shutdown" is waiting for completion or VirtualDomain_status is hanging, then the "timeout loop" (after which force_stop will be called) does not finish before the cluster times out the operation. The tijmeout code (shutdown_timeout=$(( $NOW + ($OCF_RESKEY_CRM_meta_timeout/1000) -5 ))) allows 5 extra seconds from the start of the RA (where NOW is set) for all the processing. So if you spend 2 seconds until the while loop start, and you spend three more extra seconds while waiting for 5 minutes (300s), the cluster will timeout the stop before the RA makes ist final attempt. That might be a little tight IMHO. In contrast the older Xen RA uses 1/3rd of the timeout as safety margin: $((OCF_RESKEY_CRM_meta_timeout/1500)) Any splendid insights? Regards, Ulrich _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/