On Thu, 2022-02-17 at 14:05 +0100, Lentes, Bernd wrote: > ----- On Feb 16, 2022, at 6:48 PM, arvidjaar arvidj...@gmail.com > wrote: > > > > Splitting logs between different messages does not really help in > > interpreting > > them. > > I agree. > Here is the complete excerpt from the respective time: > https://nc-mcd.helmholtz-muenchen.de/nextcloud/s/eY8SA8pe4HZBBc8 > > > I guess the real question here is why "Transition aborted" is > > logged although > > transition apparently continues. Transition 128 started at 20:54:30 > > and > > completed > > at 21:04:26, but there were multiple "Transition 128 aborted" > > messages in > > between > > That's correct. The shutdown_timeout for the domain is set with 600 > sec. in the CIB. > The RA says: > # The "shutdown_timeout" we use here is the operation > # timeout specified in the CIB, minus 5 seconds > And between 20:54:30 and 21:04:26 we have very close 595 sec. > > > It looks like "Transition aborted" is more "we try to abort this > > transition if > > possible". My guess is that pacemaker must wait for currently > > running action(s) > > which can take quite some time when stopping virtual domain. > > Transition 128 > > was initiated when stopping vm_pathway, but we have no idea when it > > was stopped. > > We have: > Feb 15 21:04:26 [15370] ha-idg-2 crmd: notice: > run_graph: Transition 128 (Complete=1, Pending=0, Fired=0, > Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input- > 3548.bz2): Complete > > and the log from libvirt confirms it: > /var/log/libvirtd/qemu/vm_pathway.log: > 2022-02-15T20:04:26.569471Z qemu-system-x86_64: terminating on signal > 15 from pid 7368 (/usr/sbin/libvirtd) > 2022-02-15 20:04:26.769+0000: shutting down, reason=destroyed > > Time in libvirt logs is UTC, and in Munich we have currently UTC+1, > so the time differs in the logs. > We see that the domain is "switched off" via libvirt exactly at > 21:04:26. > > So for me the big question is: > When a transition is happening, and there is a change in the cluster, > is the transition "aborted" > (delayed or interrupted would be better) or not ? > Is this behaviour consistent ? If no, from what does it depend ? > > Bernd
Yes, anytime the DC sees a change that could affect resources, it will abort the current transition and calculate a new one. Aborting means not initiating any new actions from the transition -- but any actions currently in flight must complete before the new transition can be calculated. Changes that abort a transition include configuration changes, a node joining or leaving, an unexpected action result being received, a node attribute changing, the cluster-recheck-interval passing since the last transition, or a timer popping for a time-based event (failure timeout, rule, etc.). I may be forgetting some, but you get the idea. -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/