I have seen this behavior on several virtualsed environments. when vm backup starts, the VM actually freezes for a (short?) Period of time.I guess it then no more responding to the other cluster nodes thus triggering unexpected fail over and/or fencing.I have this kind of behavior on VMware env using veam backup, as well promox (+ u don't what backup tool) That's actually an interesting topic I never though about rising here. How can we avoid that? Increasing timeout? I am afraid we would have to reach unacceptable high timeout values and am not even sure that would fix the pb. I think not all VM snapshots strategy would trigger that PV, do you guys have any feedback to provide on the backup/snapshot method best suits corosync clusters?
Regards Le 9 oct. 2014 01:24, "Alex Samad - Yieldbroker" <alex.sa...@yieldbroker.com> a écrit : > One of my nodes died in a 2 node cluster > > I gather something went wrong, and it fenced/killed itself. But I am not > sure what happened. > > I think maybe around that time the VM backups happened and snap of the VM > could have happened > > But there is nothing for me to put my finger on > > Output from messages around that time > > This is on devrp1 > Oct 8 23:31:38 devrp1 corosync[1670]: [TOTEM ] A processor failed, > forming new configuration. > Oct 8 23:31:40 devrp1 corosync[1670]: [CMAN ] quorum lost, blocking > activity > Oct 8 23:31:40 devrp1 corosync[1670]: [QUORUM] This node is within the > non-primary component and will NOT provide any services. > Oct 8 23:31:40 devrp1 corosync[1670]: [QUORUM] Members[1]: 1 > Oct 8 23:31:40 devrp1 corosync[1670]: [TOTEM ] A processor joined or > left the membership and a new membership was formed. > Oct 8 23:31:40 devrp1 corosync[1670]: [CPG ] chosen downlist: sender > r(0) ip(10.172.214.51) ; members(old:2 left:1) > Oct 8 23:31:40 devrp1 corosync[1670]: [MAIN ] Completed service > synchronization, ready to provide service. > Oct 8 23:31:41 devrp1 kernel: dlm: closing connection to node 2 > Oct 8 23:31:42 devrp1 crmd[2350]: notice: cman_event_callback: > Membership 424: quorum lost > Oct 8 23:31:42 devrp1 corosync[1670]: [TOTEM ] A processor joined or > left the membership and a new membership was formed. > Oct 8 23:31:42 devrp1 corosync[1670]: [CMAN ] quorum regained, > resuming activity > Oct 8 23:31:42 devrp1 corosync[1670]: [QUORUM] This node is within the > primary component and will provide service. > Oct 8 23:31:42 devrp1 corosync[1670]: [QUORUM] Members[2]: 1 2 > Oct 8 23:31:42 devrp1 corosync[1670]: [QUORUM] Members[2]: 1 2 > Oct 8 23:31:42 devrp1 corosync[1670]: [CPG ] chosen downlist: sender > r(0) ip(10.172.214.51) ; members(old:1 left:0) > Oct 8 23:31:42 devrp1 corosync[1670]: [MAIN ] Completed service > synchronization, ready to provide service. > Oct 8 23:31:42 devrp1 crmd[2350]: notice: crm_update_peer_state: > cman_event_callback: Node devrp2[2] - state is now lost (was member) > Oct 8 23:31:42 devrp1 crmd[2350]: warning: reap_dead_nodes: Our DC node > (devrp2) left the cluster > Oct 8 23:31:42 devrp1 crmd[2350]: notice: cman_event_callback: > Membership 428: quorum acquired > Oct 8 23:31:42 devrp1 crmd[2350]: notice: crm_update_peer_state: > cman_event_callback: Node devrp2[2] - state is now member (was lost) > Oct 8 23:31:42 devrp1 crmd[2350]: notice: do_state_transition: State > transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL > origin=reap_dead_nodes ] > Oct 8 23:31:42 devrp1 corosync[1670]: cman killed by node 2 because we > were killed by cman_tool or other application > Oct 8 23:31:42 devrp1 pacemakerd[2339]: error: pcmk_cpg_dispatch: > Connection to the CPG API failed: Library error (2) > Oct 8 23:31:42 devrp1 stonith-ng[2346]: error: pcmk_cpg_dispatch: > Connection to the CPG API failed: Library error (2) > Oct 8 23:31:42 devrp1 crmd[2350]: error: pcmk_cpg_dispatch: Connection > to the CPG API failed: Library error (2) > Oct 8 23:31:42 devrp1 crmd[2350]: error: crmd_cs_destroy: connection > terminated > Oct 8 23:31:43 devrp1 fenced[1726]: cluster is down, exiting > Oct 8 23:31:43 devrp1 fenced[1726]: daemon cpg_dispatch error 2 > Oct 8 23:31:43 devrp1 attrd[2348]: error: pcmk_cpg_dispatch: > Connection to the CPG API failed: Library error (2) > Oct 8 23:31:43 devrp1 attrd[2348]: crit: attrd_cs_destroy: Lost > connection to Corosync service! > Oct 8 23:31:43 devrp1 attrd[2348]: notice: main: Exiting... > Oct 8 23:31:43 devrp1 attrd[2348]: notice: main: Disconnecting client > 0x18cf240, pid=2350... > Oct 8 23:31:43 devrp1 pacemakerd[2339]: error: mcp_cpg_destroy: > Connection destroyed > Oct 8 23:31:43 devrp1 cib[2345]: error: pcmk_cpg_dispatch: Connection > to the CPG API failed: Library error (2) > Oct 8 23:31:43 devrp1 cib[2345]: error: cib_cs_destroy: Corosync > connection lost! Exiting. > Oct 8 23:31:43 devrp1 stonith-ng[2346]: error: > stonith_peer_cs_destroy: Corosync connection terminated > Oct 8 23:31:43 devrp1 dlm_controld[1752]: cluster is down, exiting > Oct 8 23:31:43 devrp1 dlm_controld[1752]: daemon cpg_dispatch error 2 > Oct 8 23:31:43 devrp1 gfs_controld[1801]: cluster is down, exiting > Oct 8 23:31:43 devrp1 crmd[2350]: notice: crmd_exit: Forcing immediate > exit: Link has been severed (67) > Oct 8 23:31:43 devrp1 attrd[2348]: error: > attrd_cib_connection_destroy: Connection to the CIB terminated... > Oct 8 23:31:43 devrp1 lrmd[2347]: warning: qb_ipcs_event_sendv: > new_event_notification (2347-2350-6): Bad file descriptor (9) > Oct 8 23:31:43 devrp1 lrmd[2347]: warning: send_client_notify: > Notification of client crmd/94e94935-2221-434d-8a6f-90eba4ede55b failed > Oct 8 23:31:43 devrp1 lrmd[2347]: warning: send_client_notify: > Notification of client crmd/94e94935-2221-434d-8a6f-90eba4ede55b failed > > > Devrp2 > Oct 8 23:31:26 devrp2 kernel: IN=eth0 OUT= > MAC=00:50:56:a6:3a:5d:00:00:00:00:00:00:08:00 SRC=10.0.0.0 DST=224.0.0.1 > LEN=36 TOS=0x00 PREC=0x00 TTL=1 ID=0 PROTO=2 > Oct 8 23:31:42 devrp2 crmd[2361]: notice: cman_event_callback: > Membership 428: quorum lost > Oct 8 23:31:43 devrp2 fenced[1747]: telling cman to remove nodeid 1 from > cluster > Oct 8 23:31:45 devrp2 corosync[1685]: [TOTEM ] A processor failed, > forming new configuration. > Oct 8 23:31:45 devrp2 corosync[1685]: [CMAN ] quorum lost, blocking > activity > Oct 8 23:31:45 devrp2 corosync[1685]: [QUORUM] This node is within the > non-primary component and will NOT provide any services. > Oct 8 23:31:45 devrp2 corosync[1685]: [QUORUM] Members[1]: 2 > Oct 8 23:31:45 devrp2 corosync[1685]: [TOTEM ] A processor joined or > left the membership and a new membership was formed. > Oct 8 23:31:45 devrp2 corosync[1685]: [CMAN ] quorum regained, > resuming activity > Oct 8 23:31:45 devrp2 corosync[1685]: [QUORUM] This node is within the > primary component and will provide service. > Oct 8 23:31:45 devrp2 corosync[1685]: [QUORUM] Members[2]: 1 2 > Oct 8 23:31:45 devrp2 corosync[1685]: [QUORUM] Members[2]: 1 2 > Oct 8 23:31:45 devrp2 corosync[1685]: [CPG ] chosen downlist: sender > r(0) ip(10.172.214.52) ; members(old:2 left:1) > Oct 8 23:31:45 devrp2 corosync[1685]: [MAIN ] Completed service > synchronization, ready to provide service. > Oct 8 23:31:45 devrp2 fenced[1747]: receive_start 1:3 add node with > started_count 1 > Oct 8 23:31:45 devrp2 crmd[2361]: notice: cman_event_callback: > Membership 428: quorum acquired > Oct 8 23:31:45 devrp2 crmd[2361]: notice: crm_update_peer_state: > cman_event_callback: Node devrp1[1] - state is now lost (was member) > Oct 8 23:31:45 devrp2 crmd[2361]: warning: match_down_event: No match > for shutdown action on devrp1 > Oct 8 23:31:45 devrp2 crmd[2361]: notice: peer_update_callback: > Stonith/shutdown of devrp1 not matched > Oct 8 23:31:49 devrp2 crmd[2361]: notice: do_state_transition: State > transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL > origin=check_join_state ] > Oct 8 23:31:49 devrp2 crmd[2361]: warning: match_down_event: No match > for shutdown action on devrp1 > Oct 8 23:31:49 devrp2 crmd[2361]: notice: peer_update_callback: > Stonith/shutdown of devrp1 not matched > Oct 8 23:31:49 devrp2 crmd[2361]: notice: do_election_count_vote: > Election 2 (current: 2, owner: devrp1): Processed vote from devrp1 (Peer is > not part of our cluster) > Oct 8 23:31:49 devrp2 pengine[2360]: notice: unpack_config: On loss of > CCM Quorum: Ignore > Oct 8 23:31:49 devrp2 kernel: dlm: closing connection to node 1 > Oct 8 23:31:49 devrp2 attrd[2359]: notice: attrd_local_callback: > Sending full refresh (origin=crmd) > Oct 8 23:31:49 devrp2 attrd[2359]: notice: attrd_trigger_update: > Sending flush op to all hosts for: last-failure-ybrpstat (1412658023) > Oct 8 23:31:49 devrp2 dlm_controld[1773]: > /sys/kernel/config/dlm/cluster/comms/1: rmdir failed: 2 > Oct 8 23:31:49 devrp2 attrd[2359]: notice: attrd_trigger_update: > Sending flush op to all hosts for: probe_complete (true) > Oct 8 23:31:49 devrp2 pengine[2360]: notice: LogActions: Start > ybrpip#011(devrp2) > Oct 8 23:31:49 devrp2 pengine[2360]: notice: process_pe_message: > Calculated Transition 1061: /var/lib/pacemaker/pengine/pe-input-2282.bz2 > Oct 8 23:31:50 devrp2 pengine[2360]: notice: unpack_config: On loss of > CCM Quorum: Ignore > Oct 8 23:31:50 devrp2 pengine[2360]: notice: LogActions: Start > ybrpip#011(devrp2) > Oct 8 23:31:50 devrp2 pengine[2360]: notice: process_pe_message: > Calculated Transition 1062: /var/lib/pacemaker/pengine/pe-input-2283.bz2 > Oct 8 23:31:50 devrp2 crmd[2361]: notice: te_rsc_command: Initiating > action 5: start ybrpip_start_0 on devrp2 (local) > Oct 8 23:31:51 devrp2 IPaddr2(ybrpip)[30442]: INFO: Adding inet address > 10.172.214.50/24 with broadcast address 10.172.214.255 to device eth0 > Oct 8 23:31:51 devrp2 IPaddr2(ybrpip)[30442]: INFO: Bringing device eth0 > up > Oct 8 23:31:51 devrp2 IPaddr2(ybrpip)[30442]: INFO: > /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p > /var/run/resource-agents/send_arp-10.172.214.50 eth0 10.172.214.50 auto > not_used not_used > Oct 8 23:31:51 devrp2 crmd[2361]: notice: process_lrm_event: LRM > operation ybrpip_start_0 (call=995, rc=0, cib-update=1494, confirmed=true) > ok > Oct 8 23:31:51 devrp2 crmd[2361]: notice: te_rsc_command: Initiating > action 6: monitor ybrpip_monitor_5000 on devrp2 (local) > Oct 8 23:31:51 devrp2 crmd[2361]: notice: process_lrm_event: LRM > operation ybrpip_monitor_5000 (call=998, rc=0, cib-update=1495, > confirmed=false) ok > Oct 8 23:31:51 devrp2 crmd[2361]: notice: run_graph: Transition 1062 > (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, > Source=/var/lib/pacemaker/pengine/pe-input-2283.bz2): Complete > Oct 8 23:31:51 devrp2 crmd[2361]: notice: do_state_transition: State > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS > cause=C_FSA_INTERNAL origin=notify_crmd ] > Oct 8 23:31:52 devrp2 ntpd[2099]: Listen normally on 6 eth0 10.172.214.50 > UDP 123 > Oct 8 23:31:52 devrp2 ntpd[2099]: peers refreshed > Oct 8 23:31:52 devrp2 corosync[1685]: [TOTEM ] A processor failed, > forming new configuration. > Oct 8 23:31:54 devrp2 corosync[1685]: [CMAN ] quorum lost, blocking > activity > Oct 8 23:31:54 devrp2 corosync[1685]: [QUORUM] This node is within the > non-primary component and will NOT provide any services. > Oct 8 23:31:54 devrp2 corosync[1685]: [QUORUM] Members[1]: 2 > Oct 8 23:31:54 devrp2 corosync[1685]: [TOTEM ] A processor joined or > left the membership and a new membership was formed. > Oct 8 23:31:54 devrp2 crmd[2361]: notice: cman_event_callback: > Membership 432: quorum lost > Oct 8 23:31:54 devrp2 corosync[1685]: [CPG ] chosen downlist: sender > r(0) ip(10.172.214.52) ; members(old:2 left:1) > Oct 8 23:31:54 devrp2 corosync[1685]: [MAIN ] Completed service > synchronization, ready to provide service. > Oct 8 23:31:54 devrp2 crmd[2361]: warning: match_down_event: No match > for shutdown action on devrp1 > Oct 8 23:31:54 devrp2 crmd[2361]: notice: peer_update_callback: > Stonith/shutdown of devrp1 not matched > Oct 8 23:31:54 devrp2 crmd[2361]: notice: do_state_transition: State > transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL > origin=check_join_state ] > Oct 8 23:31:54 devrp2 attrd[2359]: notice: attrd_local_callback: > Sending full refresh (origin=crmd) > Oct 8 23:31:54 devrp2 attrd[2359]: notice: attrd_trigger_update: > Sending flush op to all hosts for: last-failure-ybrpstat (1412658023) > Oct 8 23:31:55 devrp2 attrd[2359]: notice: attrd_trigger_update: > Sending flush op to all hosts for: probe_complete (true) > Oct 8 23:31:55 devrp2 pengine[2360]: notice: unpack_config: On loss of > CCM Quorum: Ignore > Oct 8 23:31:55 devrp2 pengine[2360]: notice: process_pe_message: > Calculated Transition 1063: /var/lib/pacemaker/pengine/pe-input-2284.bz2 > Oct 8 23:31:55 devrp2 crmd[2361]: notice: run_graph: Transition 1063 > (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, > Source=/var/lib/pacemaker/pengine/pe-input-2284.bz2): Complete > Oct 8 23:31:55 devrp2 crmd[2361]: notice: do_state_transition: State > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS > cause=C_FSA_INTERNAL origin=notify_crmd ] > Oct 8 23:31:59 devrp2 kernel: IN=eth0 OUT= > MAC=00:50:56:a6:3a:5d:00:00:00:00:00:00:08:00 SRC=10.0.0.0 DST=224.0.0.1 > LEN=36 TOS=0x00 PREC=0x00 TTL=1 ID=0 PROTO=2 > Oct 8 23:36:52 devrp2 kernel: IN=eth0 OUT= > MAC=00:50:56:a6:3a:5d:00:00:00:00:00:00:08:00 SRC=10.0.0.0 DST=224.0.0.1 > LEN=36 TOS=0x00 PREC=0x00 TTL=1 ID=0 PROTO=2 > Oct 8 23:46:55 devrp2 crmd[2361]: notice: do_state_transition: State > transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED > origin=crm_timer_popped ] > Oct 8 23:46:55 devrp2 pengine[2360]: notice: unpack_config: On loss of > CCM Quorum: Ignore > Oct 8 23:46:55 devrp2 pengine[2360]: notice: process_pe_message: > Calculated Transition 1064: /var/lib/pacemaker/pengine/pe-input-2284.bz2 > Oct 8 23:46:55 devrp2 crmd[2361]: notice: run_graph: Transition 1064 > (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, > Source=/var/lib/pacemaker/pengine/pe-input-2284.bz2): Complete > Oct 8 23:46:55 devrp2 crmd[2361]: notice: do_state_transition: State > transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS > cause=C_FSA_INTERNAL origin=notify_crmd ] > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org >
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org