Re: [Pacemaker] help deciphering output

2014-10-09 Thread Alexandre
I have seen this behavior on several virtualsed environments. when vm
backup starts, the VM actually freezes for a (short?) Period of time.I
guess it then no more responding to the other cluster nodes thus triggering
unexpected fail over and/or fencing.I have this kind of behavior on VMware
env using veam backup, as well promox (+ u don't what backup tool)
That's actually an interesting topic I never though about rising here.
How can we avoid that? Increasing timeout? I am afraid we would have to
reach unacceptable high timeout values and am not even sure that would fix
the pb.
I think not all VM snapshots strategy would trigger that PV, do you guys
have any feedback to provide on the backup/snapshot method best suits
corosync clusters?

Regards
Le 9 oct. 2014 01:24, Alex Samad - Yieldbroker alex.sa...@yieldbroker.com
a écrit :

 One of my nodes died in a 2 node cluster

 I gather something went wrong, and it fenced/killed itself. But I am not
 sure what happened.

 I think maybe around that time the VM backups happened and snap of the VM
 could have happened

 But there is nothing for me to put my finger on

 Output from messages around that time

 This is on devrp1
 Oct  8 23:31:38 devrp1 corosync[1670]:   [TOTEM ] A processor failed,
 forming new configuration.
 Oct  8 23:31:40 devrp1 corosync[1670]:   [CMAN  ] quorum lost, blocking
 activity
 Oct  8 23:31:40 devrp1 corosync[1670]:   [QUORUM] This node is within the
 non-primary component and will NOT provide any services.
 Oct  8 23:31:40 devrp1 corosync[1670]:   [QUORUM] Members[1]: 1
 Oct  8 23:31:40 devrp1 corosync[1670]:   [TOTEM ] A processor joined or
 left the membership and a new membership was formed.
 Oct  8 23:31:40 devrp1 corosync[1670]:   [CPG   ] chosen downlist: sender
 r(0) ip(10.172.214.51) ; members(old:2 left:1)
 Oct  8 23:31:40 devrp1 corosync[1670]:   [MAIN  ] Completed service
 synchronization, ready to provide service.
 Oct  8 23:31:41 devrp1 kernel: dlm: closing connection to node 2
 Oct  8 23:31:42 devrp1 crmd[2350]:   notice: cman_event_callback:
 Membership 424: quorum lost
 Oct  8 23:31:42 devrp1 corosync[1670]:   [TOTEM ] A processor joined or
 left the membership and a new membership was formed.
 Oct  8 23:31:42 devrp1 corosync[1670]:   [CMAN  ] quorum regained,
 resuming activity
 Oct  8 23:31:42 devrp1 corosync[1670]:   [QUORUM] This node is within the
 primary component and will provide service.
 Oct  8 23:31:42 devrp1 corosync[1670]:   [QUORUM] Members[2]: 1 2
 Oct  8 23:31:42 devrp1 corosync[1670]:   [QUORUM] Members[2]: 1 2
 Oct  8 23:31:42 devrp1 corosync[1670]:   [CPG   ] chosen downlist: sender
 r(0) ip(10.172.214.51) ; members(old:1 left:0)
 Oct  8 23:31:42 devrp1 corosync[1670]:   [MAIN  ] Completed service
 synchronization, ready to provide service.
 Oct  8 23:31:42 devrp1 crmd[2350]:   notice: crm_update_peer_state:
 cman_event_callback: Node devrp2[2] - state is now lost (was member)
 Oct  8 23:31:42 devrp1 crmd[2350]:  warning: reap_dead_nodes: Our DC node
 (devrp2) left the cluster
 Oct  8 23:31:42 devrp1 crmd[2350]:   notice: cman_event_callback:
 Membership 428: quorum acquired
 Oct  8 23:31:42 devrp1 crmd[2350]:   notice: crm_update_peer_state:
 cman_event_callback: Node devrp2[2] - state is now member (was lost)
 Oct  8 23:31:42 devrp1 crmd[2350]:   notice: do_state_transition: State
 transition S_NOT_DC - S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL
 origin=reap_dead_nodes ]
 Oct  8 23:31:42 devrp1 corosync[1670]: cman killed by node 2 because we
 were killed by cman_tool or other application
 Oct  8 23:31:42 devrp1 pacemakerd[2339]:error: pcmk_cpg_dispatch:
 Connection to the CPG API failed: Library error (2)
 Oct  8 23:31:42 devrp1 stonith-ng[2346]:error: pcmk_cpg_dispatch:
 Connection to the CPG API failed: Library error (2)
 Oct  8 23:31:42 devrp1 crmd[2350]:error: pcmk_cpg_dispatch: Connection
 to the CPG API failed: Library error (2)
 Oct  8 23:31:42 devrp1 crmd[2350]:error: crmd_cs_destroy: connection
 terminated
 Oct  8 23:31:43 devrp1 fenced[1726]: cluster is down, exiting
 Oct  8 23:31:43 devrp1 fenced[1726]: daemon cpg_dispatch error 2
 Oct  8 23:31:43 devrp1 attrd[2348]:error: pcmk_cpg_dispatch:
 Connection to the CPG API failed: Library error (2)
 Oct  8 23:31:43 devrp1 attrd[2348]: crit: attrd_cs_destroy: Lost
 connection to Corosync service!
 Oct  8 23:31:43 devrp1 attrd[2348]:   notice: main: Exiting...
 Oct  8 23:31:43 devrp1 attrd[2348]:   notice: main: Disconnecting client
 0x18cf240, pid=2350...
 Oct  8 23:31:43 devrp1 pacemakerd[2339]:error: mcp_cpg_destroy:
 Connection destroyed
 Oct  8 23:31:43 devrp1 cib[2345]:error: pcmk_cpg_dispatch: Connection
 to the CPG API failed: Library error (2)
 Oct  8 23:31:43 devrp1 cib[2345]:error: cib_cs_destroy: Corosync
 connection lost!  Exiting.
 Oct  8 23:31:43 devrp1 stonith-ng[2346]:error:
 stonith_peer_cs_destroy: Corosync connection terminated
 Oct  8 23:31:43 devrp1 dlm_controld[1752]: 

Re: [Pacemaker] help deciphering output

2014-10-09 Thread Andrew Beekhof

On 9 Oct 2014, at 5:06 pm, Alexandre alxg...@gmail.com wrote:

 I have seen this behavior on several virtualsed environments. when vm backup 
 starts, the VM actually freezes for a (short?) Period of time.I guess it then 
 no more responding to the other cluster nodes thus triggering unexpected fail 
 over and/or fencing.

Alas the dlm is _really_ intolerant of any membership blips.
Once a node is marked failed the dlm wants it fenced.  Even if is comes back 
1ms later.

 I have this kind of behavior on VMware env using veam backup, as well promox 
 (+ u don't what backup tool)
 That's actually an interesting topic I never though about rising here.
 How can we avoid that? Increasing timeout? I am afraid we would have to reach 
 unacceptable high timeout values and am not even sure that would fix the pb.
 I think not all VM snapshots strategy would trigger that PV, do you guys have 
 any feedback to provide on the backup/snapshot method best suits corosync 
 clusters?
 
 Regards
 
 Le 9 oct. 2014 01:24, Alex Samad - Yieldbroker alex.sa...@yieldbroker.com 
 a écrit :
 One of my nodes died in a 2 node cluster
 
 I gather something went wrong, and it fenced/killed itself. But I am not sure 
 what happened.
 
 I think maybe around that time the VM backups happened and snap of the VM 
 could have happened
 
 But there is nothing for me to put my finger on
 
 Output from messages around that time
 
 This is on devrp1
 Oct  8 23:31:38 devrp1 corosync[1670]:   [TOTEM ] A processor failed, forming 
 new configuration.
 Oct  8 23:31:40 devrp1 corosync[1670]:   [CMAN  ] quorum lost, blocking 
 activity
 Oct  8 23:31:40 devrp1 corosync[1670]:   [QUORUM] This node is within the 
 non-primary component and will NOT provide any services.
 Oct  8 23:31:40 devrp1 corosync[1670]:   [QUORUM] Members[1]: 1
 Oct  8 23:31:40 devrp1 corosync[1670]:   [TOTEM ] A processor joined or left 
 the membership and a new membership was formed.
 Oct  8 23:31:40 devrp1 corosync[1670]:   [CPG   ] chosen downlist: sender 
 r(0) ip(10.172.214.51) ; members(old:2 left:1)
 Oct  8 23:31:40 devrp1 corosync[1670]:   [MAIN  ] Completed service 
 synchronization, ready to provide service.
 Oct  8 23:31:41 devrp1 kernel: dlm: closing connection to node 2
 Oct  8 23:31:42 devrp1 crmd[2350]:   notice: cman_event_callback: Membership 
 424: quorum lost
 Oct  8 23:31:42 devrp1 corosync[1670]:   [TOTEM ] A processor joined or left 
 the membership and a new membership was formed.
 Oct  8 23:31:42 devrp1 corosync[1670]:   [CMAN  ] quorum regained, resuming 
 activity
 Oct  8 23:31:42 devrp1 corosync[1670]:   [QUORUM] This node is within the 
 primary component and will provide service.
 Oct  8 23:31:42 devrp1 corosync[1670]:   [QUORUM] Members[2]: 1 2
 Oct  8 23:31:42 devrp1 corosync[1670]:   [QUORUM] Members[2]: 1 2
 Oct  8 23:31:42 devrp1 corosync[1670]:   [CPG   ] chosen downlist: sender 
 r(0) ip(10.172.214.51) ; members(old:1 left:0)
 Oct  8 23:31:42 devrp1 corosync[1670]:   [MAIN  ] Completed service 
 synchronization, ready to provide service.
 Oct  8 23:31:42 devrp1 crmd[2350]:   notice: crm_update_peer_state: 
 cman_event_callback: Node devrp2[2] - state is now lost (was member)
 Oct  8 23:31:42 devrp1 crmd[2350]:  warning: reap_dead_nodes: Our DC node 
 (devrp2) left the cluster
 Oct  8 23:31:42 devrp1 crmd[2350]:   notice: cman_event_callback: Membership 
 428: quorum acquired
 Oct  8 23:31:42 devrp1 crmd[2350]:   notice: crm_update_peer_state: 
 cman_event_callback: Node devrp2[2] - state is now member (was lost)
 Oct  8 23:31:42 devrp1 crmd[2350]:   notice: do_state_transition: State 
 transition S_NOT_DC - S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL 
 origin=reap_dead_nodes ]
 Oct  8 23:31:42 devrp1 corosync[1670]: cman killed by node 2 because we were 
 killed by cman_tool or other application
 Oct  8 23:31:42 devrp1 pacemakerd[2339]:error: pcmk_cpg_dispatch: 
 Connection to the CPG API failed: Library error (2)
 Oct  8 23:31:42 devrp1 stonith-ng[2346]:error: pcmk_cpg_dispatch: 
 Connection to the CPG API failed: Library error (2)
 Oct  8 23:31:42 devrp1 crmd[2350]:error: pcmk_cpg_dispatch: Connection to 
 the CPG API failed: Library error (2)
 Oct  8 23:31:42 devrp1 crmd[2350]:error: crmd_cs_destroy: connection 
 terminated
 Oct  8 23:31:43 devrp1 fenced[1726]: cluster is down, exiting
 Oct  8 23:31:43 devrp1 fenced[1726]: daemon cpg_dispatch error 2
 Oct  8 23:31:43 devrp1 attrd[2348]:error: pcmk_cpg_dispatch: Connection 
 to the CPG API failed: Library error (2)
 Oct  8 23:31:43 devrp1 attrd[2348]: crit: attrd_cs_destroy: Lost 
 connection to Corosync service!
 Oct  8 23:31:43 devrp1 attrd[2348]:   notice: main: Exiting...
 Oct  8 23:31:43 devrp1 attrd[2348]:   notice: main: Disconnecting client 
 0x18cf240, pid=2350...
 Oct  8 23:31:43 devrp1 pacemakerd[2339]:error: mcp_cpg_destroy: 
 Connection destroyed
 Oct  8 23:31:43 devrp1 cib[2345]:error: pcmk_cpg_dispatch: Connection to 
 the CPG API 

[Pacemaker] help deciphering output

2014-10-08 Thread Alex Samad - Yieldbroker
One of my nodes died in a 2 node cluster

I gather something went wrong, and it fenced/killed itself. But I am not sure 
what happened.

I think maybe around that time the VM backups happened and snap of the VM could 
have happened

But there is nothing for me to put my finger on

Output from messages around that time

This is on devrp1
Oct  8 23:31:38 devrp1 corosync[1670]:   [TOTEM ] A processor failed, forming 
new configuration.
Oct  8 23:31:40 devrp1 corosync[1670]:   [CMAN  ] quorum lost, blocking activity
Oct  8 23:31:40 devrp1 corosync[1670]:   [QUORUM] This node is within the 
non-primary component and will NOT provide any services.
Oct  8 23:31:40 devrp1 corosync[1670]:   [QUORUM] Members[1]: 1
Oct  8 23:31:40 devrp1 corosync[1670]:   [TOTEM ] A processor joined or left 
the membership and a new membership was formed.
Oct  8 23:31:40 devrp1 corosync[1670]:   [CPG   ] chosen downlist: sender r(0) 
ip(10.172.214.51) ; members(old:2 left:1)
Oct  8 23:31:40 devrp1 corosync[1670]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
Oct  8 23:31:41 devrp1 kernel: dlm: closing connection to node 2
Oct  8 23:31:42 devrp1 crmd[2350]:   notice: cman_event_callback: Membership 
424: quorum lost
Oct  8 23:31:42 devrp1 corosync[1670]:   [TOTEM ] A processor joined or left 
the membership and a new membership was formed.
Oct  8 23:31:42 devrp1 corosync[1670]:   [CMAN  ] quorum regained, resuming 
activity
Oct  8 23:31:42 devrp1 corosync[1670]:   [QUORUM] This node is within the 
primary component and will provide service.
Oct  8 23:31:42 devrp1 corosync[1670]:   [QUORUM] Members[2]: 1 2
Oct  8 23:31:42 devrp1 corosync[1670]:   [QUORUM] Members[2]: 1 2
Oct  8 23:31:42 devrp1 corosync[1670]:   [CPG   ] chosen downlist: sender r(0) 
ip(10.172.214.51) ; members(old:1 left:0)
Oct  8 23:31:42 devrp1 corosync[1670]:   [MAIN  ] Completed service 
synchronization, ready to provide service.
Oct  8 23:31:42 devrp1 crmd[2350]:   notice: crm_update_peer_state: 
cman_event_callback: Node devrp2[2] - state is now lost (was member)
Oct  8 23:31:42 devrp1 crmd[2350]:  warning: reap_dead_nodes: Our DC node 
(devrp2) left the cluster
Oct  8 23:31:42 devrp1 crmd[2350]:   notice: cman_event_callback: Membership 
428: quorum acquired
Oct  8 23:31:42 devrp1 crmd[2350]:   notice: crm_update_peer_state: 
cman_event_callback: Node devrp2[2] - state is now member (was lost)
Oct  8 23:31:42 devrp1 crmd[2350]:   notice: do_state_transition: State 
transition S_NOT_DC - S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL 
origin=reap_dead_nodes ]
Oct  8 23:31:42 devrp1 corosync[1670]: cman killed by node 2 because we were 
killed by cman_tool or other application
Oct  8 23:31:42 devrp1 pacemakerd[2339]:error: pcmk_cpg_dispatch: 
Connection to the CPG API failed: Library error (2)
Oct  8 23:31:42 devrp1 stonith-ng[2346]:error: pcmk_cpg_dispatch: 
Connection to the CPG API failed: Library error (2)
Oct  8 23:31:42 devrp1 crmd[2350]:error: pcmk_cpg_dispatch: Connection to 
the CPG API failed: Library error (2)
Oct  8 23:31:42 devrp1 crmd[2350]:error: crmd_cs_destroy: connection 
terminated
Oct  8 23:31:43 devrp1 fenced[1726]: cluster is down, exiting
Oct  8 23:31:43 devrp1 fenced[1726]: daemon cpg_dispatch error 2
Oct  8 23:31:43 devrp1 attrd[2348]:error: pcmk_cpg_dispatch: Connection to 
the CPG API failed: Library error (2)
Oct  8 23:31:43 devrp1 attrd[2348]: crit: attrd_cs_destroy: Lost connection 
to Corosync service!
Oct  8 23:31:43 devrp1 attrd[2348]:   notice: main: Exiting...
Oct  8 23:31:43 devrp1 attrd[2348]:   notice: main: Disconnecting client 
0x18cf240, pid=2350...
Oct  8 23:31:43 devrp1 pacemakerd[2339]:error: mcp_cpg_destroy: Connection 
destroyed
Oct  8 23:31:43 devrp1 cib[2345]:error: pcmk_cpg_dispatch: Connection to 
the CPG API failed: Library error (2)
Oct  8 23:31:43 devrp1 cib[2345]:error: cib_cs_destroy: Corosync connection 
lost!  Exiting.
Oct  8 23:31:43 devrp1 stonith-ng[2346]:error: stonith_peer_cs_destroy: 
Corosync connection terminated
Oct  8 23:31:43 devrp1 dlm_controld[1752]: cluster is down, exiting
Oct  8 23:31:43 devrp1 dlm_controld[1752]: daemon cpg_dispatch error 2
Oct  8 23:31:43 devrp1 gfs_controld[1801]: cluster is down, exiting
Oct  8 23:31:43 devrp1 crmd[2350]:   notice: crmd_exit: Forcing immediate exit: 
Link has been severed (67)
Oct  8 23:31:43 devrp1 attrd[2348]:error: attrd_cib_connection_destroy: 
Connection to the CIB terminated...
Oct  8 23:31:43 devrp1 lrmd[2347]:  warning: qb_ipcs_event_sendv: 
new_event_notification (2347-2350-6): Bad file descriptor (9)
Oct  8 23:31:43 devrp1 lrmd[2347]:  warning: send_client_notify: Notification 
of client crmd/94e94935-2221-434d-8a6f-90eba4ede55b failed
Oct  8 23:31:43 devrp1 lrmd[2347]:  warning: send_client_notify: Notification 
of client crmd/94e94935-2221-434d-8a6f-90eba4ede55b failed


Devrp2
Oct  8 23:31:26 devrp2 kernel: IN=eth0 OUT= 
MAC=00:50:56:a6:3a:5d:00:00:00:00:00:00:08:00