Re: [ClusterLabs] reducing corosync-qnetd "response time"

Jan Friesse Fri, 25 Oct 2019 00:17:57 -0700

Sherrard Burton napsal(a):

On 10/24/19 1:30 PM, Andrei Borzenkov wrote:
24.10.2019 16:54, Sherrard Burton пишет:
background:
we are upgrading a (very) old HA cluster running heartbeat DRBD and NFS,
with no stonith, to a much more modern implementation. for the existing
cluster, as well as the new one, the disk space requirements make
running a full three-node cluster infeasible, so i am trying to
configure a quorum-only node using corosync-qnetd.

the installation went fine, the nodes can communicate, etc, and the
cluster seema to perform as desired when gracefully shutting down or
restarting a node. but during my torture testing, simulating a node
crash by stopping the network on one node leaves the remaining node in
limbo for approximately 20 seconds before it and the quorum-only node
decide that they are indeed quorate.

the problem:
the intended implementation involves DRBD, and its resource-level
fencing freezes IO during the time that the remaining node is inquorate
in order to avoid any possible data divergence/split-brain. this
precaution is obviously desirable, and is the reason that i am trying to
configure this cluster "properly".

my (admittedly naive) expectation is that the remaining node and the
quorum-only node would continue ticking along as if nothing happened,
and i am hoping that this delay is due to some
misconfiguration/oversight/bone-headedness on my part.

so i am seeking understanding on the reason for this delay, and whether
there is any (prudent) way to reduce it. of course, any other advice on
the intended setup is welcome as well.

please let me know if you require any additional details.
You may be interested in this discussion

https://www.mail-archive.com/users@clusterlabs.org/msg08907.html
thanks Andrei.
my searches have brought me to that thread a few times, but i did notthink it applied because it seemed as if the asker was having issueswith complete loss of quorum and some unwanted fencing that resultedfrom that, based on the relative values of some of these timeouts.
after re-reading it, i can see how it relates to my issue. but given thenumber of iterations of suggestion/question -> misunderstanding ->correction/clarification, i was unable to distill from that discussionwhich settings should and shouldn't be touched, and which ones willpositively affect my situation while avoiding negative implications.
was there ever a "final verdict" from that discussion which would allowme to reduce the delay in determining quorum after partition withoutalso ending up in the same situation as the asker, in which conflictingtimeout values introduce a different problem?


Hi,
distillation

https://github.com/ClusterLabs/sbd/pull/76#issuecomment-486952369

This should reduce the rtime of corosync "limbo" to ~2 sec.

Regards,
  Honza

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] reducing corosync-qnetd "response time"

Reply via email to