On 9 Oct 2014, at 5:06 pm, Alexandre <alxg...@gmail.com> wrote:

> I have seen this behavior on several virtualsed environments. when vm backup 
> starts, the VM actually freezes for a (short?) Period of time.I guess it then 
> no more responding to the other cluster nodes thus triggering unexpected fail 
> over and/or fencing.

Alas the dlm is _really_ intolerant of any membership blips.
Once a node is marked failed the dlm wants it fenced.  Even if is comes back 
1ms later.

> I have this kind of behavior on VMware env using veam backup, as well promox 
> (+ u don't what backup tool)
> That's actually an interesting topic I never though about rising here.
> How can we avoid that? Increasing timeout? I am afraid we would have to reach 
> unacceptable high timeout values and am not even sure that would fix the pb.
> I think not all VM snapshots strategy would trigger that PV, do you guys have 
> any feedback to provide on the backup/snapshot method best suits corosync 
> clusters?
> 
> Regards
> 
> Le 9 oct. 2014 01:24, "Alex Samad - Yieldbroker" <alex.sa...@yieldbroker.com> 
> a écrit :
> One of my nodes died in a 2 node cluster
> 
> I gather something went wrong, and it fenced/killed itself. But I am not sure 
> what happened.
> 
> I think maybe around that time the VM backups happened and snap of the VM 
> could have happened
> 
> But there is nothing for me to put my finger on
> 
> Output from messages around that time
> 
> This is on devrp1
> Oct  8 23:31:38 devrp1 corosync[1670]:   [TOTEM ] A processor failed, forming 
> new configuration.
> Oct  8 23:31:40 devrp1 corosync[1670]:   [CMAN  ] quorum lost, blocking 
> activity
> Oct  8 23:31:40 devrp1 corosync[1670]:   [QUORUM] This node is within the 
> non-primary component and will NOT provide any services.
> Oct  8 23:31:40 devrp1 corosync[1670]:   [QUORUM] Members[1]: 1
> Oct  8 23:31:40 devrp1 corosync[1670]:   [TOTEM ] A processor joined or left 
> the membership and a new membership was formed.
> Oct  8 23:31:40 devrp1 corosync[1670]:   [CPG   ] chosen downlist: sender 
> r(0) ip(10.172.214.51) ; members(old:2 left:1)
> Oct  8 23:31:40 devrp1 corosync[1670]:   [MAIN  ] Completed service 
> synchronization, ready to provide service.
> Oct  8 23:31:41 devrp1 kernel: dlm: closing connection to node 2
> Oct  8 23:31:42 devrp1 crmd[2350]:   notice: cman_event_callback: Membership 
> 424: quorum lost
> Oct  8 23:31:42 devrp1 corosync[1670]:   [TOTEM ] A processor joined or left 
> the membership and a new membership was formed.
> Oct  8 23:31:42 devrp1 corosync[1670]:   [CMAN  ] quorum regained, resuming 
> activity
> Oct  8 23:31:42 devrp1 corosync[1670]:   [QUORUM] This node is within the 
> primary component and will provide service.
> Oct  8 23:31:42 devrp1 corosync[1670]:   [QUORUM] Members[2]: 1 2
> Oct  8 23:31:42 devrp1 corosync[1670]:   [QUORUM] Members[2]: 1 2
> Oct  8 23:31:42 devrp1 corosync[1670]:   [CPG   ] chosen downlist: sender 
> r(0) ip(10.172.214.51) ; members(old:1 left:0)
> Oct  8 23:31:42 devrp1 corosync[1670]:   [MAIN  ] Completed service 
> synchronization, ready to provide service.
> Oct  8 23:31:42 devrp1 crmd[2350]:   notice: crm_update_peer_state: 
> cman_event_callback: Node devrp2[2] - state is now lost (was member)
> Oct  8 23:31:42 devrp1 crmd[2350]:  warning: reap_dead_nodes: Our DC node 
> (devrp2) left the cluster
> Oct  8 23:31:42 devrp1 crmd[2350]:   notice: cman_event_callback: Membership 
> 428: quorum acquired
> Oct  8 23:31:42 devrp1 crmd[2350]:   notice: crm_update_peer_state: 
> cman_event_callback: Node devrp2[2] - state is now member (was lost)
> Oct  8 23:31:42 devrp1 crmd[2350]:   notice: do_state_transition: State 
> transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL 
> origin=reap_dead_nodes ]
> Oct  8 23:31:42 devrp1 corosync[1670]: cman killed by node 2 because we were 
> killed by cman_tool or other application
> Oct  8 23:31:42 devrp1 pacemakerd[2339]:    error: pcmk_cpg_dispatch: 
> Connection to the CPG API failed: Library error (2)
> Oct  8 23:31:42 devrp1 stonith-ng[2346]:    error: pcmk_cpg_dispatch: 
> Connection to the CPG API failed: Library error (2)
> Oct  8 23:31:42 devrp1 crmd[2350]:    error: pcmk_cpg_dispatch: Connection to 
> the CPG API failed: Library error (2)
> Oct  8 23:31:42 devrp1 crmd[2350]:    error: crmd_cs_destroy: connection 
> terminated
> Oct  8 23:31:43 devrp1 fenced[1726]: cluster is down, exiting
> Oct  8 23:31:43 devrp1 fenced[1726]: daemon cpg_dispatch error 2
> Oct  8 23:31:43 devrp1 attrd[2348]:    error: pcmk_cpg_dispatch: Connection 
> to the CPG API failed: Library error (2)
> Oct  8 23:31:43 devrp1 attrd[2348]:     crit: attrd_cs_destroy: Lost 
> connection to Corosync service!
> Oct  8 23:31:43 devrp1 attrd[2348]:   notice: main: Exiting...
> Oct  8 23:31:43 devrp1 attrd[2348]:   notice: main: Disconnecting client 
> 0x18cf240, pid=2350...
> Oct  8 23:31:43 devrp1 pacemakerd[2339]:    error: mcp_cpg_destroy: 
> Connection destroyed
> Oct  8 23:31:43 devrp1 cib[2345]:    error: pcmk_cpg_dispatch: Connection to 
> the CPG API failed: Library error (2)
> Oct  8 23:31:43 devrp1 cib[2345]:    error: cib_cs_destroy: Corosync 
> connection lost!  Exiting.
> Oct  8 23:31:43 devrp1 stonith-ng[2346]:    error: stonith_peer_cs_destroy: 
> Corosync connection terminated
> Oct  8 23:31:43 devrp1 dlm_controld[1752]: cluster is down, exiting
> Oct  8 23:31:43 devrp1 dlm_controld[1752]: daemon cpg_dispatch error 2
> Oct  8 23:31:43 devrp1 gfs_controld[1801]: cluster is down, exiting
> Oct  8 23:31:43 devrp1 crmd[2350]:   notice: crmd_exit: Forcing immediate 
> exit: Link has been severed (67)
> Oct  8 23:31:43 devrp1 attrd[2348]:    error: attrd_cib_connection_destroy: 
> Connection to the CIB terminated...
> Oct  8 23:31:43 devrp1 lrmd[2347]:  warning: qb_ipcs_event_sendv: 
> new_event_notification (2347-2350-6): Bad file descriptor (9)
> Oct  8 23:31:43 devrp1 lrmd[2347]:  warning: send_client_notify: Notification 
> of client crmd/94e94935-2221-434d-8a6f-90eba4ede55b failed
> Oct  8 23:31:43 devrp1 lrmd[2347]:  warning: send_client_notify: Notification 
> of client crmd/94e94935-2221-434d-8a6f-90eba4ede55b failed
> 
> 
> Devrp2
> Oct  8 23:31:26 devrp2 kernel: IN=eth0 OUT= 
> MAC=00:50:56:a6:3a:5d:00:00:00:00:00:00:08:00 SRC=10.0.0.0 DST=224.0.0.1 
> LEN=36 TOS=0x00 PREC=0x00 TTL=1 ID=0 PROTO=2
> Oct  8 23:31:42 devrp2 crmd[2361]:   notice: cman_event_callback: Membership 
> 428: quorum lost
> Oct  8 23:31:43 devrp2 fenced[1747]: telling cman to remove nodeid 1 from 
> cluster
> Oct  8 23:31:45 devrp2 corosync[1685]:   [TOTEM ] A processor failed, forming 
> new configuration.
> Oct  8 23:31:45 devrp2 corosync[1685]:   [CMAN  ] quorum lost, blocking 
> activity
> Oct  8 23:31:45 devrp2 corosync[1685]:   [QUORUM] This node is within the 
> non-primary component and will NOT provide any services.
> Oct  8 23:31:45 devrp2 corosync[1685]:   [QUORUM] Members[1]: 2
> Oct  8 23:31:45 devrp2 corosync[1685]:   [TOTEM ] A processor joined or left 
> the membership and a new membership was formed.
> Oct  8 23:31:45 devrp2 corosync[1685]:   [CMAN  ] quorum regained, resuming 
> activity
> Oct  8 23:31:45 devrp2 corosync[1685]:   [QUORUM] This node is within the 
> primary component and will provide service.
> Oct  8 23:31:45 devrp2 corosync[1685]:   [QUORUM] Members[2]: 1 2
> Oct  8 23:31:45 devrp2 corosync[1685]:   [QUORUM] Members[2]: 1 2
> Oct  8 23:31:45 devrp2 corosync[1685]:   [CPG   ] chosen downlist: sender 
> r(0) ip(10.172.214.52) ; members(old:2 left:1)
> Oct  8 23:31:45 devrp2 corosync[1685]:   [MAIN  ] Completed service 
> synchronization, ready to provide service.
> Oct  8 23:31:45 devrp2 fenced[1747]: receive_start 1:3 add node with 
> started_count 1
> Oct  8 23:31:45 devrp2 crmd[2361]:   notice: cman_event_callback: Membership 
> 428: quorum acquired
> Oct  8 23:31:45 devrp2 crmd[2361]:   notice: crm_update_peer_state: 
> cman_event_callback: Node devrp1[1] - state is now lost (was member)
> Oct  8 23:31:45 devrp2 crmd[2361]:  warning: match_down_event: No match for 
> shutdown action on devrp1
> Oct  8 23:31:45 devrp2 crmd[2361]:   notice: peer_update_callback: 
> Stonith/shutdown of devrp1 not matched
> Oct  8 23:31:49 devrp2 crmd[2361]:   notice: do_state_transition: State 
> transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL 
> origin=check_join_state ]
> Oct  8 23:31:49 devrp2 crmd[2361]:  warning: match_down_event: No match for 
> shutdown action on devrp1
> Oct  8 23:31:49 devrp2 crmd[2361]:   notice: peer_update_callback: 
> Stonith/shutdown of devrp1 not matched
> Oct  8 23:31:49 devrp2 crmd[2361]:   notice: do_election_count_vote: Election 
> 2 (current: 2, owner: devrp1): Processed vote from devrp1 (Peer is not part 
> of our cluster)
> Oct  8 23:31:49 devrp2 pengine[2360]:   notice: unpack_config: On loss of CCM 
> Quorum: Ignore
> Oct  8 23:31:49 devrp2 kernel: dlm: closing connection to node 1
> Oct  8 23:31:49 devrp2 attrd[2359]:   notice: attrd_local_callback: Sending 
> full refresh (origin=crmd)
> Oct  8 23:31:49 devrp2 attrd[2359]:   notice: attrd_trigger_update: Sending 
> flush op to all hosts for: last-failure-ybrpstat (1412658023)
> Oct  8 23:31:49 devrp2 dlm_controld[1773]: 
> /sys/kernel/config/dlm/cluster/comms/1: rmdir failed: 2
> Oct  8 23:31:49 devrp2 attrd[2359]:   notice: attrd_trigger_update: Sending 
> flush op to all hosts for: probe_complete (true)
> Oct  8 23:31:49 devrp2 pengine[2360]:   notice: LogActions: Start   
> ybrpip#011(devrp2)
> Oct  8 23:31:49 devrp2 pengine[2360]:   notice: process_pe_message: 
> Calculated Transition 1061: /var/lib/pacemaker/pengine/pe-input-2282.bz2
> Oct  8 23:31:50 devrp2 pengine[2360]:   notice: unpack_config: On loss of CCM 
> Quorum: Ignore
> Oct  8 23:31:50 devrp2 pengine[2360]:   notice: LogActions: Start   
> ybrpip#011(devrp2)
> Oct  8 23:31:50 devrp2 pengine[2360]:   notice: process_pe_message: 
> Calculated Transition 1062: /var/lib/pacemaker/pengine/pe-input-2283.bz2
> Oct  8 23:31:50 devrp2 crmd[2361]:   notice: te_rsc_command: Initiating 
> action 5: start ybrpip_start_0 on devrp2 (local)
> Oct  8 23:31:51 devrp2 IPaddr2(ybrpip)[30442]: INFO: Adding inet address 
> 10.172.214.50/24 with broadcast address 10.172.214.255 to device eth0
> Oct  8 23:31:51 devrp2 IPaddr2(ybrpip)[30442]: INFO: Bringing device eth0 up
> Oct  8 23:31:51 devrp2 IPaddr2(ybrpip)[30442]: INFO: 
> /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p 
> /var/run/resource-agents/send_arp-10.172.214.50 eth0 10.172.214.50 auto 
> not_used not_used
> Oct  8 23:31:51 devrp2 crmd[2361]:   notice: process_lrm_event: LRM operation 
> ybrpip_start_0 (call=995, rc=0, cib-update=1494, confirmed=true) ok
> Oct  8 23:31:51 devrp2 crmd[2361]:   notice: te_rsc_command: Initiating 
> action 6: monitor ybrpip_monitor_5000 on devrp2 (local)
> Oct  8 23:31:51 devrp2 crmd[2361]:   notice: process_lrm_event: LRM operation 
> ybrpip_monitor_5000 (call=998, rc=0, cib-update=1495, confirmed=false) ok
> Oct  8 23:31:51 devrp2 crmd[2361]:   notice: run_graph: Transition 1062 
> (Complete=2, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
> Source=/var/lib/pacemaker/pengine/pe-input-2283.bz2): Complete
> Oct  8 23:31:51 devrp2 crmd[2361]:   notice: do_state_transition: State 
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Oct  8 23:31:52 devrp2 ntpd[2099]: Listen normally on 6 eth0 10.172.214.50 
> UDP 123
> Oct  8 23:31:52 devrp2 ntpd[2099]: peers refreshed
> Oct  8 23:31:52 devrp2 corosync[1685]:   [TOTEM ] A processor failed, forming 
> new configuration.
> Oct  8 23:31:54 devrp2 corosync[1685]:   [CMAN  ] quorum lost, blocking 
> activity
> Oct  8 23:31:54 devrp2 corosync[1685]:   [QUORUM] This node is within the 
> non-primary component and will NOT provide any services.
> Oct  8 23:31:54 devrp2 corosync[1685]:   [QUORUM] Members[1]: 2
> Oct  8 23:31:54 devrp2 corosync[1685]:   [TOTEM ] A processor joined or left 
> the membership and a new membership was formed.
> Oct  8 23:31:54 devrp2 crmd[2361]:   notice: cman_event_callback: Membership 
> 432: quorum lost
> Oct  8 23:31:54 devrp2 corosync[1685]:   [CPG   ] chosen downlist: sender 
> r(0) ip(10.172.214.52) ; members(old:2 left:1)
> Oct  8 23:31:54 devrp2 corosync[1685]:   [MAIN  ] Completed service 
> synchronization, ready to provide service.
> Oct  8 23:31:54 devrp2 crmd[2361]:  warning: match_down_event: No match for 
> shutdown action on devrp1
> Oct  8 23:31:54 devrp2 crmd[2361]:   notice: peer_update_callback: 
> Stonith/shutdown of devrp1 not matched
> Oct  8 23:31:54 devrp2 crmd[2361]:   notice: do_state_transition: State 
> transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL 
> origin=check_join_state ]
> Oct  8 23:31:54 devrp2 attrd[2359]:   notice: attrd_local_callback: Sending 
> full refresh (origin=crmd)
> Oct  8 23:31:54 devrp2 attrd[2359]:   notice: attrd_trigger_update: Sending 
> flush op to all hosts for: last-failure-ybrpstat (1412658023)
> Oct  8 23:31:55 devrp2 attrd[2359]:   notice: attrd_trigger_update: Sending 
> flush op to all hosts for: probe_complete (true)
> Oct  8 23:31:55 devrp2 pengine[2360]:   notice: unpack_config: On loss of CCM 
> Quorum: Ignore
> Oct  8 23:31:55 devrp2 pengine[2360]:   notice: process_pe_message: 
> Calculated Transition 1063: /var/lib/pacemaker/pengine/pe-input-2284.bz2
> Oct  8 23:31:55 devrp2 crmd[2361]:   notice: run_graph: Transition 1063 
> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
> Source=/var/lib/pacemaker/pengine/pe-input-2284.bz2): Complete
> Oct  8 23:31:55 devrp2 crmd[2361]:   notice: do_state_transition: State 
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Oct  8 23:31:59 devrp2 kernel: IN=eth0 OUT= 
> MAC=00:50:56:a6:3a:5d:00:00:00:00:00:00:08:00 SRC=10.0.0.0 DST=224.0.0.1 
> LEN=36 TOS=0x00 PREC=0x00 TTL=1 ID=0 PROTO=2
> Oct  8 23:36:52 devrp2 kernel: IN=eth0 OUT= 
> MAC=00:50:56:a6:3a:5d:00:00:00:00:00:00:08:00 SRC=10.0.0.0 DST=224.0.0.1 
> LEN=36 TOS=0x00 PREC=0x00 TTL=1 ID=0 PROTO=2
> Oct  8 23:46:55 devrp2 crmd[2361]:   notice: do_state_transition: State 
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED 
> origin=crm_timer_popped ]
> Oct  8 23:46:55 devrp2 pengine[2360]:   notice: unpack_config: On loss of CCM 
> Quorum: Ignore
> Oct  8 23:46:55 devrp2 pengine[2360]:   notice: process_pe_message: 
> Calculated Transition 1064: /var/lib/pacemaker/pengine/pe-input-2284.bz2
> Oct  8 23:46:55 devrp2 crmd[2361]:   notice: run_graph: Transition 1064 
> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
> Source=/var/lib/pacemaker/pengine/pe-input-2284.bz2): Complete
> Oct  8 23:46:55 devrp2 crmd[2361]:   notice: do_state_transition: State 
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to