On Tue, 2020-08-25 at 12:28 +0530, Rohit Saini wrote: > Hi All, > I am seeing the following behavior. Can someone clarify if this is > intended behavior. If yes, then why so? Please let me know if logs > are needed for better clarity. > > 1. Without Stonith: > Continuous corosync kill on master causes switchover and makes > another node as master. But as soon as this corosync recovers, it > becomes master again. Shouldn't it become slave now?
Where resources are active or take on the master role depends on the cluster configuration, not past node issues. You may be interested in the resource-stickiness property: https://clusterlabs.org/pacemaker/doc/en-US/Pacemaker/2.0/html-single/Pacemaker_Explained/index.html#_resource_meta_attributes > 2. With Stonith: > Sometimes, on corosync kill, that node gets shooted by stonith but > sometimes not. Not able to understand this fluctuating behavior. Does > it have to do anything with faster recovery of corosync, which > stonith fails to detect? It's not failing to detect it, but recovering satisfactorily without fencing. At any given time, one of the cluster nodes is elected the designated controller (DC). When new events occur, such as a node leaving the corosync ring unexpectedly, the DC runs pacemaker's scheduler to see what needs to be done about it. In the case of a lost node, it will also erase the node's resource history, to indicate that the state of resources on the node is no longer accurately known. If no further events happened during that time, the scheduler would schedule fencing, and the cluster would carry it out. However, systemd monitors corosync and will restart it if it dies. If systemd respawns corosync fast enough (it often is sub-second), the node will rejoin the cluster before the scheduler completes its calculations and fencing is initiated. Rejoining the cluster includes re-sync'ing its resource history with the other nodes. The node join is considered new information, so the former scheduler run is cancelled (the "transition" is "aborted") and a new one is started. Since the node is now happily part of the cluster, and the resource history tells us the state of all resources on the node, no fencing is needed. > I am using > corosync-2.4.5-4.el7.x86_64 > pacemaker-1.1.19-8.el7.x86_64 > centos 7.6.1810 > > Thanks, > Rohit > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/