Sherrard Burton napsal(a):


On 10/24/19 1:30 PM, Andrei Borzenkov wrote:
24.10.2019 16:54, Sherrard Burton пишет:
background:
we are upgrading a (very) old HA cluster running heartbeat DRBD and NFS,
with no stonith, to a much more modern implementation. for the existing
cluster, as well as the new one, the disk space requirements make
running a full three-node cluster infeasible, so i am trying to
configure a quorum-only node using corosync-qnetd.

the installation went fine, the nodes can communicate, etc, and the
cluster seema to perform as desired when gracefully shutting down or
restarting a node. but during my torture testing, simulating a node
crash by stopping the network on one node leaves the remaining node in
limbo for approximately 20 seconds before it and the quorum-only node
decide that they are indeed quorate.

the problem:
the intended implementation involves DRBD, and its resource-level
fencing freezes IO during the time that the remaining node is inquorate
in order to avoid any possible data divergence/split-brain. this
precaution is obviously desirable, and is the reason that i am trying to
configure this cluster "properly".

my (admittedly naive) expectation is that the remaining node and the
quorum-only node would continue ticking along as if nothing happened,
and i am hoping that this delay is due to some
misconfiguration/oversight/bone-headedness on my part.

so i am seeking understanding on the reason for this delay, and whether
there is any (prudent) way to reduce it. of course, any other advice on
the intended setup is welcome as well.

please let me know if you require any additional details.



You may be interested in this discussion

https://www.mail-archive.com/users@clusterlabs.org/msg08907.html

thanks Andrei.

my searches have brought me to that thread a few times, but i did not think it applied because it seemed as if the asker was having issues with complete loss of quorum and some unwanted fencing that resulted from that, based on the relative values of some of these timeouts.

after re-reading it, i can see how it relates to my issue. but given the number of iterations of suggestion/question -> misunderstanding -> correction/clarification, i was unable to distill from that discussion which settings should and shouldn't be touched, and which ones will positively affect my situation while avoiding negative implications.

was there ever a "final verdict" from that discussion which would allow me to reduce the delay in determining quorum after partition without also ending up in the same situation as the asker, in which conflicting timeout values introduce a different problem?

Hi,
distillation

https://github.com/ClusterLabs/sbd/pull/76#issuecomment-486952369

This should reduce the rtime of corosync "limbo" to ~2 sec.

Regards,
  Honza



_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to