Hi: One of these days, I saw a spurious node loss on my 3-node corosync cluster with following logged in the corosync.log of one of the nodes.
Aug 18 12:40:25 corosync [pcmk ] notice: pcmk_peer_update: Transitional membership event on ring 32: memb=2, new=0, lost=1 Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: memb: vm02d780875f 67114156 Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: memb: vmfa2757171f 151000236 Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: lost: vm728316982d 201331884 Aug 18 12:40:25 corosync [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 32: memb=2, new=0, lost=0 Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: MEMB: vm02d780875f 67114156 Aug 18 12:40:25 corosync [pcmk ] info: pcmk_peer_update: MEMB: vmfa2757171f 151000236 Aug 18 12:40:25 corosync [pcmk ] info: ais_mark_unseen_peer_dead: Node vm728316982d was not seen in the previous transition Aug 18 12:40:25 corosync [pcmk ] info: update_member: Node 201331884/vm728316982d is now: lost Aug 18 12:40:25 corosync [pcmk ] info: send_member_notification: Sending membership update 32 to 3 children Aug 18 12:40:25 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: info: plugin_handle_membership: Membership 32: quorum retained Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: notice: crm_update_peer_state_iter: plugin_handle_membership: Node vm728316982d[201331884] - state is now lost (was member) Aug 18 12:40:25 [4548] vmfa2757171f crmd: info: plugin_handle_membership: Membership 32: quorum retained Aug 18 12:40:25 [4548] vmfa2757171f crmd: notice: crm_update_peer_state_iter: plugin_handle_membership: Node vm728316982d[201331884] - state is now lost (was member) Aug 18 12:40:25 [4548] vmfa2757171f crmd: info: peer_update_callback: vm728316982d is now lost (was member) Aug 18 12:40:25 [4548] vmfa2757171f crmd: warning: match_down_event: No match for shutdown action on vm728316982d Aug 18 12:40:25 [4548] vmfa2757171f crmd: notice: peer_update_callback: Stonith/shutdown of vm728316982d not matched Aug 18 12:40:25 [4548] vmfa2757171f crmd: info: crm_update_peer_join: peer_update_callback: Node vm728316982d[201331884] - join-6 phase 4 -> 0 Aug 18 12:40:25 [4548] vmfa2757171f crmd: info: abort_transition_graph: Transition aborted: Node failure (source=peer_update_callback:240, 1) Aug 18 12:40:25 [4543] vmfa2757171f cib: info: plugin_handle_membership: Membership 32: quorum retained Aug 18 12:40:25 [4543] vmfa2757171f cib: notice: crm_update_peer_state_iter: plugin_handle_membership: Node vm728316982d[201331884] - state is now lost (was member) Aug 18 12:40:25 [4543] vmfa2757171f cib: notice: crm_reap_dead_member: Removing vm728316982d/201331884 from the membership list Aug 18 12:40:25 [4543] vmfa2757171f cib: notice: reap_crm_member: Purged 1 peers with id=201331884 and/or uname=vm728316982d from the membership cache Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: notice: crm_reap_dead_member: Removing vm728316982d/201331884 from the membership list Aug 18 12:40:25 [4544] vmfa2757171f stonith-ng: notice: reap_crm_member: Purged 1 peers with id=201331884 and/or uname=vm728316982d from the membership cache However, within seconds, the node was able to join back. Aug 18 12:40:34 corosync [pcmk ] notice: pcmk_peer_update: Stable membership event on ring 36: memb=3, new=1, lost=0 Aug 18 12:40:34 corosync [pcmk ] info: update_member: Node 201331884/vm728316982d is now: member Aug 18 12:40:34 corosync [pcmk ] info: pcmk_peer_update: NEW: vm728316982d 201331884 But this was enough time for the cluster to get into split brain kind of situation with a resource on the node vm728316982d being stopped because of this node loss detection. Could anyone help whether this could happen due to any transient network distortion or so ? Are there any configuration settings that can be applied in corosync.conf so that cluster is more resilient to such temporary distortions. Currently my corosync.conf looks like this : compatibility: whitetank totem { version: 2 secauth: on threads: 0 interface { member { memberaddr: 172.20.0.4 } member { memberaddr: 172.20.0.9 } member { memberaddr: 172.20.0.12 } bindnetaddr: 172.20.0.12 ringnumber: 0 mcastport: 5405 ttl: 1 } transport: udpu token: 10000 token_retransmits_before_loss_const: 10 } logging { fileline: off to_stderr: yes to_logfile: yes to_syslog: no logfile: /var/log/cluster/corosync.log timestamp: on logger_subsys { subsys: AMF debug: off } } service { name: pacemaker ver: 1 } amf { mode: disabled } Thanks in advance for the help. Prasad
_______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org