On 8/9/19 9:06 PM, Yan Gao wrote: > On 8/9/19 6:40 PM, Andrei Borzenkov wrote: >> 09.08.2019 16:34, Yan Gao пишет: >>> Hi, >>> >>> With disk-less sbd, it's fine to stop cluster service from the cluster >>> nodes all at the same time. >>> >>> But if to stop the nodes one by one, for example with a 3-node cluster, >>> after stopping the 2nd node, the only remaining node resets itself with: >>> >> That is sort of documented in SBD manual page: >> >> --><-- >> However, while the cluster is in such a degraded state, it can >> neither successfully fence nor be shutdown cleanly (as taking the >> cluster below the quorum threshold will immediately cause all remaining >> nodes to self-fence). >> --><-- >> >> SBD in shared-nothing mode is basically always in such degraded state >> and cannot tolerate loss of quorum. > Well, the context here is it loses quorum *expectedly* since the other > nodes gracefully shut down. > >> >> >>> Aug 09 14:30:20 opensuse150-1 sbd[1079]: pcmk: debug: >>> notify_parent: Not notifying parent: state transient (2) >>> Aug 09 14:30:20 opensuse150-1 sbd[1080]: cluster: debug: >>> notify_parent: Notifying parent: healthy >>> Aug 09 14:30:20 opensuse150-1 sbd[1078]: warning: inquisitor_child: >>> Latency: No liveness for 4 s exceeds threshold of 3 s (healthy servants: 0) >>> >>> I can think of the way to manipulate quorum with last_man_standing and >>> potentially also auto_tie_breaker, not to mention >>> last_man_standing_window would also be a factor... But is there a better >>> solution? >>> >> Lack of cluster wide shutdown mode was mentioned more than once on this >> list. I guess the only workaround is to use higher level tools which >> basically simply try to stop cluster on all nodes at once. It is still >> susceptible to race condition. > Gracefully stopping nodes one by one on purpose is still a reasonable > need though ... If you do the teardown as e.g. pcs is doing it - first tear down pacemaker-instances and then corosync/sbd - it is at least possible to tear down the pacemaker-instances one-by one without risking a reboot due to quorum-loss. With kind of current sbd having in - https://github.com/ClusterLabs/sbd/commit/824fe834c67fb7bae7feb87607381f9fa8fa2945 - https://github.com/ClusterLabs/sbd/commit/79b778debfee5b4ab2d099b2bfc7385f45597f70 - https://github.com/ClusterLabs/sbd/commit/a716a8ddd3df615009bcff3bd96dd9ae64cb5f68 this should be pretty robust although we are still thinking (probably together with some heartbeat to pacemakerd that assures pacemakerd is checking liveness of sub-daemons properly) of having a cleaner way to detect graceful pacemaker-shutdown.
Klaus > > Regards, > Yan > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/