Re: [ClusterLabs] cluster doesn't do HA as expected, pingd doesn't help

Vladislav Bogdanov Tue, 19 Dec 2023 10:51:55 -0800

What if node (especially vm) freezes for several minutes and then continuesto write to a shared disk where other nodes already put their data?In my opinion, fencing, preferably two-level, is mandatory for lustre,trust me, I'd developed whole HA stack for both Exascaler and PangeaFS.We've seen so many points where data loss may occur...


On December 19, 2023 19:42:56 Artem <tyom...@gmail.com> wrote:

Andrei and Klaus thanks for prompt reply and clarification!
As I understand, design and behavior of Pacemaker is tightly coupled withthe stonith concept. But isn't it too rigid?
Is there a way to leverage self-monitoring or pingd rules to triggerisolated node to umount its FS? Like vSphere High Availability hostisolation response.Can resource-stickiness=off (auto-failback) decrease risk of corruption byunresponsive node coming back online?Is there a quorum feature not for cluster but for resource start/stop? Gotlock - is welcome to mount, unable to refresh lease - force unmount.Can on-fail=ignore break manual failover logic (stopped will be consideredas failed and thus ignored)?
best regards,
Artem

On Tue, 19 Dec 2023 at 17:03, Klaus Wenninger <kwenn...@redhat.com> wrote:


On Tue, Dec 19, 2023 at 10:00 AM Andrei Borzenkov <arvidj...@gmail.com> wrote:
On Tue, Dec 19, 2023 at 10:41 AM Artem <tyom...@gmail.com> wrote:
...
Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107](update_resource_action_runnable) warning: OST4_stop_0 on lustre4 isunrunnable (node is offline)Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107](recurring_op_for_active) info: Start 20s-interval monitor for OST4 onlustre3Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107](log_list_item) notice: Actions: Stop OST4 ( lustre4) blocked
This is the default for the failed stop operation. The only way
pacemaker can resolve failure to stop a resource is to fence the node
where this resource was active. If it is not possible (and IIRC you
refuse to use stonith), pacemaker has no other choice as to block it.
If you insist, you can of course sert on-fail=ignore, but this means
unreachable node will continue to run resources. Whether it can lead
to some corruption in your case I cannot guess.

Don't know if I'm reading that correctly but I understand what you had written
above that you try to trigger the failover by stopping the VM (lustre4) without
ordered shutdown.
With fencing disabled what we are seeing is exactly what we would expect:
The state of the resource is unknown - pacemaker tries to stop it - doesn'tworkas the node is offline - no fencing configured - so everything it can do iswait
till there is info if the resource is up or not.
I guess the strange output below is because of fencing disabled - quite an
unusual - also not recommended - configuration and so this might not have
shown up too often in that way.

Klaus
Dec 19 09:48:13 lustre-mds2.ntslab.ru pacemaker-schedulerd[785107](pcmk__create_graph) crit: Cannot fence lustre4 because of OST4:blocked (OST4_stop_0)
That is a rather strange phrase. The resource is blocked because the
pacemaker could not fence the node, not the other way round.
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] cluster doesn't do HA as expected, pingd doesn't help

Reply via email to