Hello, Our team has been using corosync + pacemaker successfully for the last year or two, but last week ran into an issue which I wanted to get some more insight on. We have a 2 node cluster, using the WaitForAll votequorum parameter so all nodes must have been seen at least once before resources are started. We have two layers of fencing configured, IPMI and SBD (storage based death, using shared storage). We have done extensive testing on our fencing in the past and it works great, but here the fencing never got called. One of our QA testers managed to pull the network cable at a very particular time during startup, and it seems to have resulted in corosync telling pacemaker that all nodes had been seen, and that the cluster was in a normal state with one node up. No fencing was ever triggered, and all resources were started normally. The other node was NOT marked unclean. This resulted in a split brain scenario, as our master database (pgsql replication) was still running as master on the other node, and had now been started and promoted on this node. Luckily this is all in a test environment, so no production impact was seen. Below is test specifics and some relevant logs.
Procedure: 1. Allow both nodes to come up fully. 2. Reboot current master node. 3. As node is booting up again (during corosync startup), pull interconnect cable. Expected Behavior: 1. Node either a) fails to start any resources or b) fences other node and promotes to master Actual behavior: 1. Node promotes to master without fencing peer, resulting in both nodes running master database. Module-2 is rebooted @ 12:57:42, and comes back up ~12:59. When corosync starts up, both nodes are visible and all vote counts are normal. Jul 15 12:59:00 module-2 corosync[2906]: [SERV ] Service engine loaded: corosync vote quorum service v1.0 [5] Jul 15 12:59:00 module-2 corosync[2906]: [TOTEM ] A new membership (10.1.1.2:56) was formed. Members joined: 2 Jul 15 12:59:00 module-2 corosync[2906]: [QUORUM] Waiting for all cluster members. Current votes: 1 expected_votes: 2 Jul 15 12:59:00 module-2 corosync[2906]: [QUORUM] Members[1]: 2 Jul 15 12:59:00 module-2 corosync[2906]: [MAIN ] Completed service synchronization, ready to provide service. Jul 15 12:59:06 module-2 pacemakerd[4076]: notice: cluster_connect_quorum: Quorum acquired 3 seconds later, the interconnect network cable is pulled. Jul 15 12:59:09 module-2 kernel: e1000e: eth3 NIC Link is Down Corosync recognizes this immediately, and declares the peer as dead. Jul 15 12:59:10 module-2 crmd[4107]: notice: peer_update_callback: Our peer on the DC (module-1) is dead Slightly later (very close), corosync initialization completes, it says it has quorum, and declares system ready for use. Jul 15 12:59:10 module-2 corosync[2906]: [QUORUM] Members[1]: 2 Jul 15 12:59:10 module-2 corosync[2906]: [MAIN ] Completed service synchronization, ready to provide service. Pacemaker starts resources normally, including Postgres. Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start fence_sbd (module-2) Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start ipmi-1 (module-2) Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start SlaveIP (module-2) Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start postgres:0 (module-2) Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start ethmonitor:0 (module-2) Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start tomcat-instance:0 (module-2 - blocked) Jul 15 12:59:13 module-2 pengine[4106]: notice: LogActions: Start ClusterMonitor:0 (module-2 - blocked) Votequorum shows 1 vote per node, WaitForAll is set. Pacemaker should not be able to start ANY resources until it has seen all nodes once. module-2 ~ # corosync-quorumtool Quorum information ------------------ Date: Wed Jul 15 18:15:34 2015 Quorum provider: corosync_votequorum Nodes: 1 Node ID: 2 Ring ID: 64 Quorate: Yes Votequorum information ---------------------- Expected votes: 2 Highest expected: 2 Total votes: 1 Quorum: 1 Flags: 2Node Quorate WaitForAll Membership information ---------------------- Nodeid Votes Name 2 1 module-2 (local) Package versions: -bash-4.3# rpm -qa | grep corosync corosynclib-2.3.4-1.fc22.x86_64 corosync-2.3.4-1.fc22.x86_64 -bash-4.3# rpm -qa | grep pacemaker pacemaker-cluster-libs-1.1.12-2.fc22.x86_64 pacemaker-libs-1.1.12-2.fc22.x86_64 pacemaker-cli-1.1.12-2.fc22.x86_64 pacemaker-1.1.12-2.fc22.x86_64
_______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org