Andrei and Klaus thanks for prompt reply and clarification! As I understand, design and behavior of Pacemaker is tightly coupled with the stonith concept. But isn't it too rigid?
Is there a way to leverage self-monitoring or pingd rules to trigger isolated node to umount its FS? Like vSphere High Availability host isolation response. Can resource-stickiness=off (auto-failback) decrease risk of corruption by unresponsive node coming back online? Is there a quorum feature not for cluster but for resource start/stop? Got lock - is welcome to mount, unable to refresh lease - force unmount. Can on-fail=ignore break manual failover logic (stopped will be considered as failed and thus ignored)? best regards, Artem On Tue, 19 Dec 2023 at 17:03, Klaus Wenninger <kwenn...@redhat.com> wrote: > > > On Tue, Dec 19, 2023 at 10:00 AM Andrei Borzenkov <arvidj...@gmail.com> > wrote: > >> On Tue, Dec 19, 2023 at 10:41 AM Artem <tyom...@gmail.com> wrote: >> ... >> > Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] >> (update_resource_action_runnable) warning: OST4_stop_0 on lustre4 is >> unrunnable (node is offline) >> > Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] >> (recurring_op_for_active) info: Start 20s-interval monitor for OST4 on >> lustre3 >> > Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] >> (log_list_item) notice: Actions: Stop OST4 ( lustre4 >> ) blocked >> >> This is the default for the failed stop operation. The only way >> pacemaker can resolve failure to stop a resource is to fence the node >> where this resource was active. If it is not possible (and IIRC you >> refuse to use stonith), pacemaker has no other choice as to block it. >> If you insist, you can of course sert on-fail=ignore, but this means >> unreachable node will continue to run resources. Whether it can lead >> to some corruption in your case I cannot guess. >> > > Don't know if I'm reading that correctly but I understand what you had > written > above that you try to trigger the failover by stopping the VM (lustre4) > without > ordered shutdown. > With fencing disabled what we are seeing is exactly what we would expect: > The state of the resource is unknown - pacemaker tries to stop it - > doesn't work > as the node is offline - no fencing configured - so everything it can do > is wait > till there is info if the resource is up or not. > I guess the strange output below is because of fencing disabled - quite an > unusual - also not recommended - configuration and so this might not have > shown up too often in that way. > > Klaus > >> >> > Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107] >> (pcmk__create_graph) crit: Cannot fence lustre4 because of OST4: >> blocked (OST4_stop_0) >> >> That is a rather strange phrase. The resource is blocked because the >> pacemaker could not fence the node, not the other way round. >> _______________________________________________ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ >> > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ >
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/