I just tried removing all the quorum options setting back to defaults so no last_man_standing or wait_for_all. I still see the same behaviour where the third node is fenced if I bring down services on two nodes. Thanks David
On Thu, 31 Aug 2023 at 11:44, Klaus Wenninger <kwenn...@redhat.com> wrote: > > > On Thu, Aug 31, 2023 at 12:28 PM David Dolan <daithido...@gmail.com> > wrote: > >> >> >> On Wed, 30 Aug 2023 at 17:35, David Dolan <daithido...@gmail.com> wrote: >> >>> >>> >>> > Hi All, >>>> > >>>> > I'm running Pacemaker on Centos7 >>>> > Name : pcs >>>> > Version : 0.9.169 >>>> > Release : 3.el7.centos.3 >>>> > Architecture: x86_64 >>>> > >>>> > >>>> Besides the pcs-version versions of the other cluster-stack-components >>>> could be interesting. (pacemaker, corosync) >>>> >>> rpm -qa | egrep "pacemaker|pcs|corosync|fence-agents" >>> fence-agents-vmware-rest-4.2.1-41.el7_9.6.x86_64 >>> corosynclib-2.4.5-7.el7_9.2.x86_64 >>> pacemaker-cluster-libs-1.1.23-1.el7_9.1.x86_64 >>> fence-agents-common-4.2.1-41.el7_9.6.x86_64 >>> corosync-2.4.5-7.el7_9.2.x86_64 >>> pacemaker-cli-1.1.23-1.el7_9.1.x86_64 >>> pacemaker-1.1.23-1.el7_9.1.x86_64 >>> pcs-0.9.169-3.el7.centos.3.x86_64 >>> pacemaker-libs-1.1.23-1.el7_9.1.x86_64 >>> >>>> >>>> >>>> > I'm performing some cluster failover tests in a 3 node cluster. We >>>> have 3 >>>> > resources in the cluster. >>>> > I was trying to see if I could get it working if 2 nodes fail at >>>> different >>>> > times. I'd like the 3 resources to then run on one node. >>>> > >>>> > The quorum options I've configured are as follows >>>> > [root@node1 ~]# pcs quorum config >>>> > Options: >>>> > auto_tie_breaker: 1 >>>> > last_man_standing: 1 >>>> > last_man_standing_window: 10000 >>>> > wait_for_all: 1 >>>> > >>>> > >>>> Not sure if the combination of auto_tie_breaker and last_man_standing >>>> makes >>>> sense. >>>> And as you have a cluster with an odd number of nodes auto_tie_breaker >>>> should be >>>> disabled anyway I guess. >>>> >>> Ah ok I'll try removing auto_tie_breaker and leave last_man_standing >>> >>>> >>>> >>>> > [root@node1 ~]# pcs quorum status >>>> > Quorum information >>>> > ------------------ >>>> > Date: Wed Aug 30 11:20:04 2023 >>>> > Quorum provider: corosync_votequorum >>>> > Nodes: 3 >>>> > Node ID: 1 >>>> > Ring ID: 1/1538 >>>> > Quorate: Yes >>>> > >>>> > Votequorum information >>>> > ---------------------- >>>> > Expected votes: 3 >>>> > Highest expected: 3 >>>> > Total votes: 3 >>>> > Quorum: 2 >>>> > Flags: Quorate WaitForAll LastManStanding AutoTieBreaker >>>> > >>>> > Membership information >>>> > ---------------------- >>>> > Nodeid Votes Qdevice Name >>>> > 1 1 NR node1 (local) >>>> > 2 1 NR node2 >>>> > 3 1 NR node3 >>>> > >>>> > If I stop the cluster services on node 2 and 3, the groups all >>>> failover to >>>> > node 1 since it is the node with the lowest ID >>>> > But if I stop them on node1 and node 2 or node1 and node3, the cluster >>>> > fails. >>>> > >>>> > I tried adding this line to corosync.conf and I could then bring down >>>> the >>>> > services on node 1 and 2 or node 2 and 3 but if I left node 2 until >>>> last, >>>> > the cluster failed >>>> > auto_tie_breaker_node: 1 3 >>>> > >>>> > This line had the same outcome as using 1 3 >>>> > auto_tie_breaker_node: 1 2 3 >>>> > >>>> > >>>> Giving multiple auto_tie_breaker-nodes doesn't make sense to me but >>>> rather >>>> sounds dangerous if that configuration is possible at all. >>>> >>>> Maybe the misbehavior of last_man_standing is due to this (maybe not >>>> recognized) misconfiguration. >>>> Did you wait long enough between letting the 2 nodes fail? >>>> >>> I've done it so many times so I believe so. But I'll try remove the >>> auto_tie_breaker config, leaving the last_man_standing. I'll also make sure >>> I leave a couple of minutes between bringing down the nodes and post back. >>> >> Just confirming I removed the auto_tie_breaker config and tested. Quorum >> configuration is as follows: >> Options: >> last_man_standing: 1 >> last_man_standing_window: 10000 >> wait_for_all: 1 >> >> I waited 2-3 minutes between stopping cluster services on two nodes via >> pcs cluster stop >> The remaining cluster node is then fenced. I was hoping the remaining >> node would stay online running the resources. >> > > Yep - that would've been my understanding as well. > But honestly I've never used last_man_standing in this context - wasn't > even aware that it was > offered without qdevice nor have I checked how it is implemented. > > Klaus > >> >> >>>> Klaus >>>> >>>> >>>> > So I'd like it to failover when any combination of two nodes fail but >>>> I've >>>> > only had success when the middle node isn't last. >>>> > >>>> > Thanks >>>> > David >>>> >>>> >>>> >>>>
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/