Hello : I have a 3 node corosync and pacemaker cluster and the nodes are: Online: [ SG-azfw2-189 SG-azfw2-190 SG-azfw2-191 ]
Full list of resources: Master/Slave Set: ms_mysql [p_mysql] Masters: [ SG-azfw2-189 ] Slaves: [ SG-azfw2-190 SG-azfw2-191 ] For my network partition test, I created a firewall rule on Node SG-azfw2-190 to block all incoming udp traffic from node SG-azfw2-189 /sbin/iptables -I INPUT -p udp -s 172.19.0.13 -j DROP I dont think corosync is correctly detecting the partition as I am getting different membership information from different nodes. On node SG-azfw2-189, I still see the members as: Online: [ SG-azfw2-189 SG-azfw2-190 SG-azfw2-191 ] Full list of resources: Master/Slave Set: ms_mysql [p_mysql] Masters: [ SG-azfw2-189 ] Slaves: [ SG-azfw2-190 SG-azfw2-191 ] whereas, on the node SG-azfw2-190, I see membership as Online: [ SG-azfw2-190 SG-azfw2-191 ] OFFLINE: [ SG-azfw2-189 ] Full list of resources: Master/Slave Set: ms_mysql [p_mysql] Slaves: [ SG-azfw2-190 SG-azfw2-191 ] Stopped: [ SG-azfw2-189 ] I expected that on node SG-azfw2-189, it should have detected that other 2 nodes have left. In the corosync logs for this node, I continuously see the below messages: Apr 30 11:00:03 corosync [TOTEM ] entering GATHER state from 4. Apr 30 11:00:03 corosync [TOTEM ] Creating commit token because I am the rep. Apr 30 11:00:03 corosync [MAIN ] Storing new sequence id for ring 2e64 Apr 30 11:00:03 corosync [TOTEM ] entering COMMIT state. Apr 30 11:00:33 corosync [TOTEM ] The token was lost in the COMMIT state. Apr 30 11:00:33 corosync [TOTEM ] entering GATHER state from 4. Apr 30 11:00:33 corosync [TOTEM ] Creating commit token because I am the rep. Apr 30 11:00:33 corosync [MAIN ] Storing new sequence id for ring 2e68 Apr 30 11:00:33 corosync [TOTEM ] entering COMMIT state. Apr 30 11:01:03 corosync [TOTEM ] The token was lost in the COMMIT state. On the other nodes - I see messages like notice: pcmk_peer_update: Transitional membership event on ring 11888: memb=2, new=0, lost=0 Apr 30 11:06:10 corosync [pcmk ] info: pcmk_peer_update: memb: SG-azfw2-190 301994924 Apr 30 11:06:10 corosync [pcmk ] info: pcmk_peer_update: memb: SG-azfw2-191 603984812 Apr 30 11:06:10 corosync [TOTEM ] waiting_trans_ack changed to 1 Apr 30 11:06:10 corosync [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 11888: memb=2, new=0, lost=0 Apr 30 11:06:10 corosync [pcmk ] info: pcmk_peer_update: MEMB: SG-azfw2-190 301994924 Apr 30 11:06:10 corosync [pcmk ] info: pcmk_peer_update: MEMB: SG-azfw2-191 603984812 Apr 30 11:06:10 corosync [SYNC ] This node is within the primary component and will provide service. Apr 30 11:06:10 corosync [TOTEM ] entering OPERATIONAL state. Can the corosync experts please guide me on probable root cause for this or ways to debug this further ? Help much appreciated. corosync version: 1.4.8. pacemaker version: 1.1.14-8.el6_8.1 Thanks!
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/