[Linux-cluster] Rejoin cluster after failure without reboot?

Jonathan Davies Wed, 25 Nov 2015 07:34:38 -0800

Hi,

I'm experimenting with corosync+dlm+gfs2 (approximately followinghttp://people.redhat.com/teigland/cluster4-gfs2-dlm.txt) and am tryingto establish whether it meets my requirements. I have a query about anode rejoining a cluster after failure, and want to make sure I'm notoverlooking something.

I have a three-node cluster and deliberately cause token loss byfirewalling one of them (call it node A) out of the network for longerthan the token timeout. At this point, the other two hosts (B and C)decide that A has disappeared and continue with quorum. That is fine.

When I unfirewall node A, dlm tries to reconnect to its peers on B andC. But then I see the following on host B:


16:29:25.823496 nodeb dlm_controld[6548]: 908 daemon node 85 stateful merge

16:29:25.823529 nodeb dlm_controld[6548]: 908 daemon node 85 kill due tostateful merge16:29:25.823543 nodeb dlm_controld[6548]: 908 tell corosync to removenodeid 85 from cluster16:29:25.823696 nodeb corosync[6536]: [CFG ] request to kill node85(us=83): xxx


and then the following on node A:

16:29:25.828547 nodea corosync[3896]: [CFG ] Killed by node 83:dlm_controld16:29:25.828575 nodea corosync[3896]: [MAIN ] Corosync Cluster Engineexiting with status -1 at cfg.c:530.16:29:25.834828 nodea dlm_controld[3466]: 1183 process_cluster_cfgcfg_dispatch 2

16:29:25.834871 nodea dlm_controld[3466]: 1183 cluster is down, exiting

16:29:25.834886 nodea dlm_controld[3466]: 1183 process_clusterquorum_dispatch 2

16:29:25.834903 nodea dlm_controld[3466]: 1183 daemon cpg_dispatch error 2
16:29:25.834917 nodea dlm_controld[3466]: 1183 cpg_dispatch error 2
16:29:25.837152 nodea dlm_controld[3466]: 1183 abandoned lockspace mygfs2

resulting in both corosync and dlm_controld exiting on node A.

Later, if I try to manually restart corosync and dlm on node A, I seethe following:


16:32:08.382871 nodea dlm_controld[20483]: 2872 dlm_controld 4.0.2 started

16:32:08.392453 nodea dlm_controld[20483]: 2872 found uncontrolledlockspace mygfs216:32:08.392477 nodea dlm_controld[20483]: 2872 tell corosync to removenodeid 85 from cluster16:32:08.394965 nodea corosync[20456]: [CFG ] request to kill node85(us=85): xxx16:32:08.394998 nodea corosync[20456]: [CFG ] Killed by node 85:dlm_controld


The only way of making A rejoin the cluster is to reboot.

I would be grateful if you could confirm the following statements:

(a) The "stateful merge" is unavoidable when node A leaves thecluster for longer than the token timeout then tries to rejoin.(b) Killing corosync on node A is unavoidable when node B sees the"stateful merge".

  (c) dlm exiting is unavoidable when corosync dies.

(d) Restarting corosync then dlm on node A will necessarily result in"found uncontrolled lockspace".(e) The only way to recover from "found uncontrolled lockspace" (fora gfs2 lockspace) is to reboot.

I'm hoping that I'm overlooking something and that at least one of(a)--(e) is false! I'm not comfortable with a reboot being the onlymeans of recovery when the token timeout is exceeded.


Thanks,
Jonathan

--
Linux-cluster mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/linux-cluster

[Linux-cluster] Rejoin cluster after failure without reboot?

Reply via email to