>>> "Lentes, Bernd" <bernd.len...@helmholtz-muenchen.de> schrieb am 23.10.2020 um 23:16 in Nachricht <1814448122.1773393.1603487817751.javamail.zim...@helmholtz-muenchen.de>:
> > ----- On Oct 23, 2020, at 8:45 PM, Valentin Vidić > vvi...@valentin-vidic.from.hr wrote: > >> On Fri, Oct 23, 2020 at 08:08:31PM +0200, Lentes, Bernd wrote: >>> But when the timeout has run out the RA tries to kill the machine with a > "virsh >>> destroy". >>> And if that does not work (what is occasionally my problem) because the > domain >>> is in uninterruptable sleep (D state) the RA gives a $OCF_ERR_GENERIC back > which >>> cause pacemaker to fence the lazy node. Or am i wrong ? >> >> What does the log look like when this happens? >> > > /var/log/cluster/corosync.log: > > VirtualDomain(vm_amok)[8998]: 2020/09/27_22:34:11 INFO: Issuing graceful > shutdown request for domain vm_amok. > > VirtualDomain(vm_amok)[8998]: 2020/09/27_22:37:06 INFO: Issuing forced > shutdown (destroy) request for domain vm_amok. > Sep 27 22:37:11 [11282] ha-idg-2 lrmd: warning: > child_timeout_callback: vm_amok_stop_0 process (PID 8998) timed out > Sep 27 22:37:11 [11282] ha-idg-2 lrmd: warning: operation_finished: > vm_amok_stop_0:8998 - timed out after 180000ms > timeout of the domain is 180 sec. > > /var/log/libvirt/libvirtd.log (time is UTC): > > 2020-09-27 20:37:21.489+0000: 18583: error : virProcessKillPainfully:401 : > Failed to terminate process 14037 with SIGKILL: Device or resource busy "SIGKILL: Device or resource busy" is nonsense: kill does not wait; it either fails or succeeds. > 2020-09-27 20:37:21.505+0000: 6610: error : virNetSocketWriteWire:1852 : > Cannot write data: Broken pipe > 2020-09-27 20:37:31.962+0000: 6610: error : qemuMonitorIO:719 : internal > error: End of file from qemu monitor > > SIGKILL didn't work. Nevertheless the process is finished 20 seconds later > after destroy, surely because it woke up from D and received the signal. > > /var/log/cluster/corosync.log on the DC: > > Sep 27 22:37:11 [3580] ha-idg-1 crmd: warning: status_from_rc: > Action 93 (vm_amok_stop_0) on ha-idg-2 failed (target: 0 vs. rc: 1): Error > Stop (also sigkill) failed > Sep 27 22:37:11 [3579] ha-idg-1 pengine: notice: > native_stop_constraints: Stop of failed resource vm_amok is implicit after > ha-idg-2 is fenced > cluster decides to fence the node although resource is stopped 10 seconds > later > > atop log: > 14037 - S 261% /usr/bin/qemu-system-x86_64 -machine accel=kvm -name > guest=vm_amok,debug-threads=on -S -object secret,id=masterKey0 ... > PID of the domain is 14037 > > 14037 - E 0% worker (at 22:37:31) > domain has stoppped > > > Bernd > Helmholtz Zentrum München > > Helmholtz Zentrum Muenchen > Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) > Ingolstaedter Landstr. 1 > 85764 Neuherberg > www.helmholtz-muenchen.de > Aufsichtsratsvorsitzende: MinDir.in Prof. Dr. Veronika von Messling > Geschaeftsfuehrung: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Kerstin > Guenther > Registergericht: Amtsgericht Muenchen HRB 6466 > USt-IdNr: DE 129521671 > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/