Re: [Pacemaker] help deciphering output
I have seen this behavior on several virtualsed environments. when vm backup starts, the VM actually freezes for a (short?) Period of time.I guess it then no more responding to the other cluster nodes thus triggering unexpected fail over and/or fencing.I have this kind of behavior on VMware env using veam backup, as well promox (+ u don't what backup tool) That's actually an interesting topic I never though about rising here. How can we avoid that? Increasing timeout? I am afraid we would have to reach unacceptable high timeout values and am not even sure that would fix the pb. I think not all VM snapshots strategy would trigger that PV, do you guys have any feedback to provide on the backup/snapshot method best suits corosync clusters? Regards Le 9 oct. 2014 01:24, Alex Samad - Yieldbroker alex.sa...@yieldbroker.com a écrit : One of my nodes died in a 2 node cluster I gather something went wrong, and it fenced/killed itself. But I am not sure what happened. I think maybe around that time the VM backups happened and snap of the VM could have happened But there is nothing for me to put my finger on Output from messages around that time This is on devrp1 Oct 8 23:31:38 devrp1 corosync[1670]: [TOTEM ] A processor failed, forming new configuration. Oct 8 23:31:40 devrp1 corosync[1670]: [CMAN ] quorum lost, blocking activity Oct 8 23:31:40 devrp1 corosync[1670]: [QUORUM] This node is within the non-primary component and will NOT provide any services. Oct 8 23:31:40 devrp1 corosync[1670]: [QUORUM] Members[1]: 1 Oct 8 23:31:40 devrp1 corosync[1670]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Oct 8 23:31:40 devrp1 corosync[1670]: [CPG ] chosen downlist: sender r(0) ip(10.172.214.51) ; members(old:2 left:1) Oct 8 23:31:40 devrp1 corosync[1670]: [MAIN ] Completed service synchronization, ready to provide service. Oct 8 23:31:41 devrp1 kernel: dlm: closing connection to node 2 Oct 8 23:31:42 devrp1 crmd[2350]: notice: cman_event_callback: Membership 424: quorum lost Oct 8 23:31:42 devrp1 corosync[1670]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Oct 8 23:31:42 devrp1 corosync[1670]: [CMAN ] quorum regained, resuming activity Oct 8 23:31:42 devrp1 corosync[1670]: [QUORUM] This node is within the primary component and will provide service. Oct 8 23:31:42 devrp1 corosync[1670]: [QUORUM] Members[2]: 1 2 Oct 8 23:31:42 devrp1 corosync[1670]: [QUORUM] Members[2]: 1 2 Oct 8 23:31:42 devrp1 corosync[1670]: [CPG ] chosen downlist: sender r(0) ip(10.172.214.51) ; members(old:1 left:0) Oct 8 23:31:42 devrp1 corosync[1670]: [MAIN ] Completed service synchronization, ready to provide service. Oct 8 23:31:42 devrp1 crmd[2350]: notice: crm_update_peer_state: cman_event_callback: Node devrp2[2] - state is now lost (was member) Oct 8 23:31:42 devrp1 crmd[2350]: warning: reap_dead_nodes: Our DC node (devrp2) left the cluster Oct 8 23:31:42 devrp1 crmd[2350]: notice: cman_event_callback: Membership 428: quorum acquired Oct 8 23:31:42 devrp1 crmd[2350]: notice: crm_update_peer_state: cman_event_callback: Node devrp2[2] - state is now member (was lost) Oct 8 23:31:42 devrp1 crmd[2350]: notice: do_state_transition: State transition S_NOT_DC - S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL origin=reap_dead_nodes ] Oct 8 23:31:42 devrp1 corosync[1670]: cman killed by node 2 because we were killed by cman_tool or other application Oct 8 23:31:42 devrp1 pacemakerd[2339]:error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Oct 8 23:31:42 devrp1 stonith-ng[2346]:error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Oct 8 23:31:42 devrp1 crmd[2350]:error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Oct 8 23:31:42 devrp1 crmd[2350]:error: crmd_cs_destroy: connection terminated Oct 8 23:31:43 devrp1 fenced[1726]: cluster is down, exiting Oct 8 23:31:43 devrp1 fenced[1726]: daemon cpg_dispatch error 2 Oct 8 23:31:43 devrp1 attrd[2348]:error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Oct 8 23:31:43 devrp1 attrd[2348]: crit: attrd_cs_destroy: Lost connection to Corosync service! Oct 8 23:31:43 devrp1 attrd[2348]: notice: main: Exiting... Oct 8 23:31:43 devrp1 attrd[2348]: notice: main: Disconnecting client 0x18cf240, pid=2350... Oct 8 23:31:43 devrp1 pacemakerd[2339]:error: mcp_cpg_destroy: Connection destroyed Oct 8 23:31:43 devrp1 cib[2345]:error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Oct 8 23:31:43 devrp1 cib[2345]:error: cib_cs_destroy: Corosync connection lost! Exiting. Oct 8 23:31:43 devrp1 stonith-ng[2346]:error: stonith_peer_cs_destroy: Corosync connection terminated Oct 8 23:31:43 devrp1 dlm_controld[1752]:
Re: [Pacemaker] help deciphering output
On 9 Oct 2014, at 5:06 pm, Alexandre alxg...@gmail.com wrote: I have seen this behavior on several virtualsed environments. when vm backup starts, the VM actually freezes for a (short?) Period of time.I guess it then no more responding to the other cluster nodes thus triggering unexpected fail over and/or fencing. Alas the dlm is _really_ intolerant of any membership blips. Once a node is marked failed the dlm wants it fenced. Even if is comes back 1ms later. I have this kind of behavior on VMware env using veam backup, as well promox (+ u don't what backup tool) That's actually an interesting topic I never though about rising here. How can we avoid that? Increasing timeout? I am afraid we would have to reach unacceptable high timeout values and am not even sure that would fix the pb. I think not all VM snapshots strategy would trigger that PV, do you guys have any feedback to provide on the backup/snapshot method best suits corosync clusters? Regards Le 9 oct. 2014 01:24, Alex Samad - Yieldbroker alex.sa...@yieldbroker.com a écrit : One of my nodes died in a 2 node cluster I gather something went wrong, and it fenced/killed itself. But I am not sure what happened. I think maybe around that time the VM backups happened and snap of the VM could have happened But there is nothing for me to put my finger on Output from messages around that time This is on devrp1 Oct 8 23:31:38 devrp1 corosync[1670]: [TOTEM ] A processor failed, forming new configuration. Oct 8 23:31:40 devrp1 corosync[1670]: [CMAN ] quorum lost, blocking activity Oct 8 23:31:40 devrp1 corosync[1670]: [QUORUM] This node is within the non-primary component and will NOT provide any services. Oct 8 23:31:40 devrp1 corosync[1670]: [QUORUM] Members[1]: 1 Oct 8 23:31:40 devrp1 corosync[1670]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Oct 8 23:31:40 devrp1 corosync[1670]: [CPG ] chosen downlist: sender r(0) ip(10.172.214.51) ; members(old:2 left:1) Oct 8 23:31:40 devrp1 corosync[1670]: [MAIN ] Completed service synchronization, ready to provide service. Oct 8 23:31:41 devrp1 kernel: dlm: closing connection to node 2 Oct 8 23:31:42 devrp1 crmd[2350]: notice: cman_event_callback: Membership 424: quorum lost Oct 8 23:31:42 devrp1 corosync[1670]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Oct 8 23:31:42 devrp1 corosync[1670]: [CMAN ] quorum regained, resuming activity Oct 8 23:31:42 devrp1 corosync[1670]: [QUORUM] This node is within the primary component and will provide service. Oct 8 23:31:42 devrp1 corosync[1670]: [QUORUM] Members[2]: 1 2 Oct 8 23:31:42 devrp1 corosync[1670]: [QUORUM] Members[2]: 1 2 Oct 8 23:31:42 devrp1 corosync[1670]: [CPG ] chosen downlist: sender r(0) ip(10.172.214.51) ; members(old:1 left:0) Oct 8 23:31:42 devrp1 corosync[1670]: [MAIN ] Completed service synchronization, ready to provide service. Oct 8 23:31:42 devrp1 crmd[2350]: notice: crm_update_peer_state: cman_event_callback: Node devrp2[2] - state is now lost (was member) Oct 8 23:31:42 devrp1 crmd[2350]: warning: reap_dead_nodes: Our DC node (devrp2) left the cluster Oct 8 23:31:42 devrp1 crmd[2350]: notice: cman_event_callback: Membership 428: quorum acquired Oct 8 23:31:42 devrp1 crmd[2350]: notice: crm_update_peer_state: cman_event_callback: Node devrp2[2] - state is now member (was lost) Oct 8 23:31:42 devrp1 crmd[2350]: notice: do_state_transition: State transition S_NOT_DC - S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL origin=reap_dead_nodes ] Oct 8 23:31:42 devrp1 corosync[1670]: cman killed by node 2 because we were killed by cman_tool or other application Oct 8 23:31:42 devrp1 pacemakerd[2339]:error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Oct 8 23:31:42 devrp1 stonith-ng[2346]:error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Oct 8 23:31:42 devrp1 crmd[2350]:error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Oct 8 23:31:42 devrp1 crmd[2350]:error: crmd_cs_destroy: connection terminated Oct 8 23:31:43 devrp1 fenced[1726]: cluster is down, exiting Oct 8 23:31:43 devrp1 fenced[1726]: daemon cpg_dispatch error 2 Oct 8 23:31:43 devrp1 attrd[2348]:error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Oct 8 23:31:43 devrp1 attrd[2348]: crit: attrd_cs_destroy: Lost connection to Corosync service! Oct 8 23:31:43 devrp1 attrd[2348]: notice: main: Exiting... Oct 8 23:31:43 devrp1 attrd[2348]: notice: main: Disconnecting client 0x18cf240, pid=2350... Oct 8 23:31:43 devrp1 pacemakerd[2339]:error: mcp_cpg_destroy: Connection destroyed Oct 8 23:31:43 devrp1 cib[2345]:error: pcmk_cpg_dispatch: Connection to the CPG API
[Pacemaker] help deciphering output
One of my nodes died in a 2 node cluster I gather something went wrong, and it fenced/killed itself. But I am not sure what happened. I think maybe around that time the VM backups happened and snap of the VM could have happened But there is nothing for me to put my finger on Output from messages around that time This is on devrp1 Oct 8 23:31:38 devrp1 corosync[1670]: [TOTEM ] A processor failed, forming new configuration. Oct 8 23:31:40 devrp1 corosync[1670]: [CMAN ] quorum lost, blocking activity Oct 8 23:31:40 devrp1 corosync[1670]: [QUORUM] This node is within the non-primary component and will NOT provide any services. Oct 8 23:31:40 devrp1 corosync[1670]: [QUORUM] Members[1]: 1 Oct 8 23:31:40 devrp1 corosync[1670]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Oct 8 23:31:40 devrp1 corosync[1670]: [CPG ] chosen downlist: sender r(0) ip(10.172.214.51) ; members(old:2 left:1) Oct 8 23:31:40 devrp1 corosync[1670]: [MAIN ] Completed service synchronization, ready to provide service. Oct 8 23:31:41 devrp1 kernel: dlm: closing connection to node 2 Oct 8 23:31:42 devrp1 crmd[2350]: notice: cman_event_callback: Membership 424: quorum lost Oct 8 23:31:42 devrp1 corosync[1670]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Oct 8 23:31:42 devrp1 corosync[1670]: [CMAN ] quorum regained, resuming activity Oct 8 23:31:42 devrp1 corosync[1670]: [QUORUM] This node is within the primary component and will provide service. Oct 8 23:31:42 devrp1 corosync[1670]: [QUORUM] Members[2]: 1 2 Oct 8 23:31:42 devrp1 corosync[1670]: [QUORUM] Members[2]: 1 2 Oct 8 23:31:42 devrp1 corosync[1670]: [CPG ] chosen downlist: sender r(0) ip(10.172.214.51) ; members(old:1 left:0) Oct 8 23:31:42 devrp1 corosync[1670]: [MAIN ] Completed service synchronization, ready to provide service. Oct 8 23:31:42 devrp1 crmd[2350]: notice: crm_update_peer_state: cman_event_callback: Node devrp2[2] - state is now lost (was member) Oct 8 23:31:42 devrp1 crmd[2350]: warning: reap_dead_nodes: Our DC node (devrp2) left the cluster Oct 8 23:31:42 devrp1 crmd[2350]: notice: cman_event_callback: Membership 428: quorum acquired Oct 8 23:31:42 devrp1 crmd[2350]: notice: crm_update_peer_state: cman_event_callback: Node devrp2[2] - state is now member (was lost) Oct 8 23:31:42 devrp1 crmd[2350]: notice: do_state_transition: State transition S_NOT_DC - S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL origin=reap_dead_nodes ] Oct 8 23:31:42 devrp1 corosync[1670]: cman killed by node 2 because we were killed by cman_tool or other application Oct 8 23:31:42 devrp1 pacemakerd[2339]:error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Oct 8 23:31:42 devrp1 stonith-ng[2346]:error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Oct 8 23:31:42 devrp1 crmd[2350]:error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Oct 8 23:31:42 devrp1 crmd[2350]:error: crmd_cs_destroy: connection terminated Oct 8 23:31:43 devrp1 fenced[1726]: cluster is down, exiting Oct 8 23:31:43 devrp1 fenced[1726]: daemon cpg_dispatch error 2 Oct 8 23:31:43 devrp1 attrd[2348]:error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Oct 8 23:31:43 devrp1 attrd[2348]: crit: attrd_cs_destroy: Lost connection to Corosync service! Oct 8 23:31:43 devrp1 attrd[2348]: notice: main: Exiting... Oct 8 23:31:43 devrp1 attrd[2348]: notice: main: Disconnecting client 0x18cf240, pid=2350... Oct 8 23:31:43 devrp1 pacemakerd[2339]:error: mcp_cpg_destroy: Connection destroyed Oct 8 23:31:43 devrp1 cib[2345]:error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Oct 8 23:31:43 devrp1 cib[2345]:error: cib_cs_destroy: Corosync connection lost! Exiting. Oct 8 23:31:43 devrp1 stonith-ng[2346]:error: stonith_peer_cs_destroy: Corosync connection terminated Oct 8 23:31:43 devrp1 dlm_controld[1752]: cluster is down, exiting Oct 8 23:31:43 devrp1 dlm_controld[1752]: daemon cpg_dispatch error 2 Oct 8 23:31:43 devrp1 gfs_controld[1801]: cluster is down, exiting Oct 8 23:31:43 devrp1 crmd[2350]: notice: crmd_exit: Forcing immediate exit: Link has been severed (67) Oct 8 23:31:43 devrp1 attrd[2348]:error: attrd_cib_connection_destroy: Connection to the CIB terminated... Oct 8 23:31:43 devrp1 lrmd[2347]: warning: qb_ipcs_event_sendv: new_event_notification (2347-2350-6): Bad file descriptor (9) Oct 8 23:31:43 devrp1 lrmd[2347]: warning: send_client_notify: Notification of client crmd/94e94935-2221-434d-8a6f-90eba4ede55b failed Oct 8 23:31:43 devrp1 lrmd[2347]: warning: send_client_notify: Notification of client crmd/94e94935-2221-434d-8a6f-90eba4ede55b failed Devrp2 Oct 8 23:31:26 devrp2 kernel: IN=eth0 OUT= MAC=00:50:56:a6:3a:5d:00:00:00:00:00:00:08:00