Hi all,
When i use the following command to simulate data lost of network at one member
of my 3 nodes Pacemaker+Corosync cluster,
sometimes it cause Pacemaker on another node exit.
tc qdisc add dev eth2 root netem loss 90%
Is there any method to avoid this proleam?
[root@node3 ~]# ps -ef|grep pacemaker
root 32540 1 0 00:57 ? 00:00:00 /usr/libexec/pacemaker/lrmd
189 32542 1 0 00:57 ? 00:00:00 /usr/libexec/pacemaker/pengine
root 33491 11491 0 00:58 pts/1 00:00:00 grep pacemaker
/var/log/cluster/corosync.log
------------------------------------------------
Aug 27 12:33:59 [46855] node3 cib: info: cib_process_request:
Completed cib_modify operation for section status: OK (rc=0,
origin=local/attrd/230, version=10.657.19)
Aug 27 12:33:59 corosync [CPG ] chosen downlist: sender r(0)
ip(192.168.125.129) ; members(old:2 left:1)
Aug 27 12:33:59 [46849] node3 pacemakerd: info: pcmk_cpg_membership:
Node 2172496064 joined group pacemakerd (counter=12.0)
Aug 27 12:33:59 [46849] node3 pacemakerd: info: pcmk_cpg_membership:
Node 2172496064 still member of group pacemakerd (peer=node2, counter=12.0)
Aug 27 12:33:59 [46849] node3 pacemakerd: info: crm_update_peer_proc:
pcmk_cpg_membership: Node node2[2172496064] - corosync-cpg is now online
Aug 27 12:33:59 [46849] node3 pacemakerd: info: pcmk_cpg_membership:
Node 2273159360 still member of group pacemakerd (peer=node3, counter=12.1)
Aug 27 12:33:59 [46849] node3 pacemakerd: info: crm_cs_flush: Sent 0
CPG messages (1 remaining, last=19): Try again (6)
Aug 27 12:33:59 [46849] node3 pacemakerd: info: pcmk_cpg_membership:
Node 2273159360 left group pacemakerd (peer=node3, counter=13.0)
Aug 27 12:33:59 [46849] node3 pacemakerd: info: crm_update_peer_proc:
pcmk_cpg_membership: Node node3[2273159360] - corosync-cpg is now offline
Aug 27 12:33:59 [46849] node3 pacemakerd: info: pcmk_cpg_membership:
Node 2172496064 still member of group pacemakerd (peer=node2, counter=13.0)
Aug 27 12:33:59 [46849] node3 pacemakerd: error: pcmk_cpg_membership:
We're not part of CPG group 'pacemakerd' anymore!
Aug 27 12:33:59 [46849] node3 pacemakerd: error: pcmk_cpg_dispatch: Evicted
from CPG membership
Aug 27 12:33:59 [46849] node3 pacemakerd: error: mcp_cpg_destroy:
Connection destroyed
Aug 27 12:33:59 [46849] node3 pacemakerd: info: crm_xml_cleanup:
Cleaning up memory from libxml2
Aug 27 12:33:59 [46858] node3 attrd: error: crm_ipc_read:
Connection to pacemakerd failed
Aug 27 12:33:59 [46858] node3 attrd: error: mainloop_gio_callback:
Connection to pacemakerd[0x1255eb0] closed (I/O condition=17)
Aug 27 12:33:59 [46858] node3 attrd: crit: attrd_cs_destroy: Lost
connection to Corosync service!
Aug 27 12:33:59 [46858] node3 attrd: notice: main: Exiting...
Aug 27 12:33:59 [46858] node3 attrd: notice: main: Disconnecting
client 0x12579a0, pid=46860...
Aug 27 12:33:59 [46858] node3 attrd: error:
attrd_cib_connection_destroy: Connection to the CIB terminated...
Aug 27 12:33:59 corosync [pcmk ] info: pcmk_ipc_exit: Client attrd
(conn=0x1955f80, async-conn=0x1955f80) left
Aug 27 12:33:59 [46856] node3 stonith-ng: error: crm_ipc_read:
Connection to pacemakerd failed
Aug 27 12:33:59 [46856] node3 stonith-ng: error: mainloop_gio_callback:
Connection to pacemakerd[0x2314af0] closed (I/O condition=17)
Aug 27 12:33:59 [46856] node3 stonith-ng: error: stonith_peer_cs_destroy:
Corosync connection terminated
Aug 27 12:33:59 [46856] node3 stonith-ng: info: stonith_shutdown:
Terminating with 1 clients
Aug 27 12:33:59 [46856] node3 stonith-ng: info: cib_connection_destroy:
Connection to the CIB closed.
...
please see corosynclog.txt for detail of log
[root@node3 ~]# cat /etc/corosync/corosync.conf
totem {
version: 2
secauth: off
interface {
member {
memberaddr: 192.168.125.134
}
member {
memberaddr: 192.168.125.129
}
member {
memberaddr: 192.168.125.135
}
ringnumber: 0
bindnetaddr: 192.168.125.135
mcastport: 5405
ttl: 1
}
transport: udpu
}
logging {
fileline: off
to_logfile: yes
to_syslog: no
logfile: /var/log/cluster/corosync.log
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}
service {
ver: 1
name: pacemaker
}
Environment:
[root@node3 ~]# rpm -q corosync
corosync-1.4.1-7.el6.x86_64
[root@node3 ~]# cat /etc/redhat-release
CentOS release 6.3 (Final)
[root@node3 ~]# pacemakerd -F
Pacemaker 1.1.14-1.el6 (Build: 70404b0)
Supporting v3.0.10: generated-manpages agent-manpages ascii-docs ncurses
libqb-logging libqb-ipc nagios corosync-plugin cman acls
Aug 27 12:33:58 corosync [TOTEM ] Process pause detected for 1115 ms, flushing
membership messages.
Aug 27 12:33:59 corosync [TOTEM ] Process pause detected for 652 ms, flushing
membership messages.
Aug 27 12:33:59 [46860] node3 crmd: info: action_synced_wait: Managed
mysqlmha_meta-data_0 process 49106 exited with rc=0
Aug 27 12:33:59 corosync [pcmk ] notice: pcmk_peer_update: Transitional
membership event on ring 25712: memb=1, new=0, lost=0
Aug 27 12:33:59 corosync [pcmk ] info: pcmk_peer_update: memb: node3 2273159360
Aug 27 12:33:59 corosync [pcmk ] notice: pcmk_peer_update: Stable membership
event on ring 25712: memb=2, new=1, lost=0
Aug 27 12:33:59 corosync [pcmk ] info: update_member: Node 2172496064/node2 is
now: member
Aug 27 12:33:59 corosync [pcmk ] info: pcmk_peer_update: NEW: node2 2172496064
Aug 27 12:33:59 corosync [pcmk ] info: pcmk_peer_update: MEMB: node2 2172496064
Aug 27 12:33:59 corosync [pcmk ] info: pcmk_peer_update: MEMB: node3 2273159360
Aug 27 12:33:59 corosync [pcmk ] info: send_member_notification: Sending
membership update 25712 to 3 children
Aug 27 12:33:59 corosync [TOTEM ] A processor joined or left the membership and
a new membership was formed.
Aug 27 12:33:59 [46856] node3 stonith-ng: notice: plugin_handle_membership:
Membership 25712: quorum acquired
Aug 27 12:33:59 [46856] node3 stonith-ng: info: crm_get_peer: Created
entry 11364f1b-043f-483e-af7b-7c3ce0c66274/0x2476a70 for node node2/2172496064
(2 total)
Aug 27 12:33:59 [46856] node3 stonith-ng: info: crm_get_peer: Node
2172496064 is now known as node2
Aug 27 12:33:59 [46855] node3 cib: notice: plugin_handle_membership:
Membership 25712: quorum acquired
Aug 27 12:33:59 [46855] node3 cib: info: crm_get_peer: Created
entry 1ad99213-c2ed-4bc9-a32f-9ec76d95178a/0xb66e70 for node node2/2172496064
(2 total)
Aug 27 12:33:59 [46855] node3 cib: info: crm_get_peer: Node
2172496064 is now known as node2
Aug 27 12:33:59 [46855] node3 cib: info: crm_get_peer: Node
2172496064 has uuid node2
Aug 27 12:33:59 [46855] node3 cib: notice: crm_update_peer_state_iter:
plugin_handle_membership: Node node2[2172496064] - state is now member (was
(null))
Aug 27 12:33:59 [46855] node3 cib: info: crm_update_peer:
plugin_handle_membership: Node node2: id=2172496064 state=member addr=r(0)
ip(192.168.125.129) (new) votes=1 (new) born=24148 seen=25712
proc=00000000000000000000000000000000
Aug 27 12:33:59 [46856] node3 stonith-ng: info: crm_get_peer: Node
2172496064 has uuid node2
Aug 27 12:33:59 [46856] node3 stonith-ng: notice: crm_update_peer_state_iter:
plugin_handle_membership: Node node2[2172496064] - state is now member (was
(null))
Aug 27 12:33:59 [46856] node3 stonith-ng: info: crm_update_peer:
plugin_handle_membership: Node node2: id=2172496064 state=member addr=r(0)
ip(192.168.125.129) (new) votes=1 (new) born=24148 seen=25712
proc=00000000000000000000000000000000
Aug 27 12:33:59 [46860] node3 crmd: notice: plugin_handle_membership:
Membership 25712: quorum acquired
Aug 27 12:33:59 [46860] node3 crmd: notice: crm_update_peer_state_iter:
plugin_handle_membership: Node node2[2172496064] - state is now member (was
lost)
Aug 27 12:33:59 [46860] node3 crmd: info: peer_update_callback:
node2 is now member (was lost)
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: Diff:
--- 10.657.15 2
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: Diff:
+++ 10.657.16 (null)
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: +
/cib: @num_updates=16
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: +
/cib/status/node_state[@id='node2']: @crm-debug-origin=peer_update_callback
Aug 27 12:33:59 [46855] node3 cib: info: cib_process_request:
Completed cib_modify operation for section status: OK (rc=0,
origin=local/crmd/266, version=10.657.16)
Aug 27 12:33:59 [46855] node3 cib: info: cib_process_request:
Completed cib_modify operation for section nodes: OK (rc=0,
origin=local/crmd/269, version=10.657.16)
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: Diff:
--- 10.657.16 2
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: Diff:
+++ 10.657.17 (null)
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: +
/cib: @num_updates=17
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: +
/cib/status/node_state[@id='node3']: @crm-debug-origin=post_cache_update
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: +
/cib/status/node_state[@id='node2']: @crm-debug-origin=post_cache_update
Aug 27 12:33:59 [46855] node3 cib: info: cib_process_request:
Completed cib_modify operation for section status: OK (rc=0,
origin=local/crmd/270, version=10.657.17)
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: Diff:
--- 10.657.17 2
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: Diff:
+++ 10.657.18 (null)
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: +
/cib: @num_updates=18, @dc-uuid=node3
Aug 27 12:33:59 [46855] node3 cib: info: cib_process_request:
Completed cib_modify operation for section cib: OK (rc=0,
origin=local/crmd/271, version=10.657.18)
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: Diff:
+++ 10.657.16 (null)
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: +
/cib: @num_updates=16
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: +
/cib/status/node_state[@id='node2']: @crm-debug-origin=peer_update_callback
Aug 27 12:33:59 [46855] node3 cib: info: cib_process_request:
Completed cib_modify operation for section status: OK (rc=0,
origin=local/crmd/266, version=10.657.16)
Aug 27 12:33:59 [46855] node3 cib: info: cib_process_request:
Completed cib_modify operation for section nodes: OK (rc=0,
origin=local/crmd/269, version=10.657.16)
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: Diff:
--- 10.657.16 2
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: Diff:
+++ 10.657.17 (null)
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: +
/cib: @num_updates=17
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: +
/cib/status/node_state[@id='node3']: @crm-debug-origin=post_cache_update
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: +
/cib/status/node_state[@id='node2']: @crm-debug-origin=post_cache_update
Aug 27 12:33:59 [46855] node3 cib: info: cib_process_request:
Completed cib_modify operation for section status: OK (rc=0,
origin=local/crmd/270, version=10.657.17)
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: Diff:
--- 10.657.17 2
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: Diff:
+++ 10.657.18 (null)
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: +
/cib: @num_updates=18, @dc-uuid=node3
Aug 27 12:33:59 [46855] node3 cib: info: cib_process_request:
Completed cib_modify operation for section cib: OK (rc=0,
origin=local/crmd/271, version=10.657.18)
Aug 27 12:33:59 [46860] node3 crmd: info: crmd_cs_dispatch: Setting
expected votes to 3
Aug 27 12:33:59 [46860] node3 crmd: info: register_fsa_error_adv:
Resetting the current action list
Aug 27 12:33:59 [46860] node3 crmd: warning: crmd_ha_msg_filter: Another
DC detected: node2 (op=noop)
Aug 27 12:33:59 [46860] node3 crmd: info: do_state_transition:
State transition S_FINALIZE_JOIN -> S_ELECTION [ input=I_ELECTION
cause=C_FSA_INTERNAL origin=crmd_ha_msg_filter ]
Aug 27 12:33:59 [46860] node3 crmd: info: update_dc: Unset DC. Was
node3
Aug 27 12:33:59 [46860] node3 crmd: info: do_log: FSA: Input
I_JOIN_RESULT from route_message() received in state S_ELECTION
Aug 27 12:33:59 [46855] node3 cib: info: xml_patch_version_check:
Current num_updates is too high (18 > 15)
Aug 27 12:33:59 [46855] node3 cib: warning: cib_server_process_diff:
Something went wrong in compatibility mode, requesting full refresh
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: Diff:
--- 10.657.15 2
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: Diff:
+++ 10.657.16 (null)
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: +
/cib: @num_updates=16
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: +
/cib/status/node_state[@id='node3']: @crm-debug-origin=peer_update_callback
Aug 27 12:33:59 [46855] node3 cib: info: send_sync_request:
Requesting re-sync from peer
Aug 27 12:33:59 [46855] node3 cib: info: xml_patch_version_check:
Current num_updates is too high (18 > 16)
Aug 27 12:33:59 [46855] node3 cib: warning: cib_server_process_diff:
Something went wrong in compatibility mode, requesting full refresh
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: Diff:
--- 10.657.16 2
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: Diff:
+++ 10.657.17 (null)
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: +
/cib: @num_updates=17
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: +
/cib/status/node_state[@id='node3']: @in_ccm=false,
@crm-debug-origin=post_cache_update
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: +
/cib/status/node_state[@id='node2']: @crm-debug-origin=post_cache_update
Aug 27 12:33:59 [46855] node3 cib: info: send_sync_request:
Requesting re-sync from peer
Aug 27 12:33:59 [46855] node3 cib: info: xml_patch_version_check:
Current num_updates is too high (18 > 17)
Aug 27 12:33:59 [46855] node3 cib: warning: cib_server_process_diff:
Something went wrong in compatibility mode, requesting full refresh
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: Diff:
--- 10.657.17 2
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: Diff:
+++ 10.657.18 (null)
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: +
/cib: @num_updates=18, @have-quorum=0
Aug 27 12:33:59 [46855] node3 cib: info: send_sync_request:
Requesting re-sync from peer
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: Diff:
--- 10.657.18 2
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: Diff:
+++ 10.657.19 (null)
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: +
/cib: @num_updates=19, @have-quorum=0, @dc-uuid=node2
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: +
/cib/status/node_state[@id='node3']: @in_ccm=false,
@crm-debug-origin=do_state_transition, @join=down
Aug 27 12:33:59 [46855] node3 cib: info: cib_perform_op: +
/cib/status/node_state[@id='node2']: @crm-debug-origin=do_state_transition
Aug 27 12:33:59 [46855] node3 cib: info: cib_process_request:
Completed cib_apply_diff operation for section status: OK (rc=0,
origin=node2/crmd/748, version=10.657.19)
Aug 27 12:33:59 [46855] node3 cib: info: cib_process_request:
Completed cib_modify operation for section crm_config: OK (rc=0,
origin=local/crmd/273, version=10.657.19)
Aug 27 12:33:59 [46855] node3 cib: info: cib_process_request:
Completed cib_modify operation for section status: OK (rc=0,
origin=local/attrd/226, version=10.657.19)
Aug 27 12:33:59 [46855] node3 cib: info: cib_process_request:
Completed cib_modify operation for section status: OK (rc=0,
origin=local/attrd/228, version=10.657.19)
Aug 27 12:33:59 [46855] node3 cib: info: cib_process_request:
Completed cib_modify operation for section status: OK (rc=0,
origin=local/attrd/230, version=10.657.19)
Aug 27 12:33:59 corosync [CPG ] chosen downlist: sender r(0)
ip(192.168.125.129) ; members(old:2 left:1)
Aug 27 12:33:59 [46849] node3 pacemakerd: info: pcmk_cpg_membership:
Node 2172496064 joined group pacemakerd (counter=12.0)
Aug 27 12:33:59 [46849] node3 pacemakerd: info: pcmk_cpg_membership:
Node 2172496064 still member of group pacemakerd (peer=node2, counter=12.0)
Aug 27 12:33:59 [46849] node3 pacemakerd: info: crm_update_peer_proc:
pcmk_cpg_membership: Node node2[2172496064] - corosync-cpg is now online
Aug 27 12:33:59 [46849] node3 pacemakerd: info: pcmk_cpg_membership:
Node 2273159360 still member of group pacemakerd (peer=node3, counter=12.1)
Aug 27 12:33:59 [46849] node3 pacemakerd: info: crm_cs_flush: Sent 0
CPG messages (1 remaining, last=19): Try again (6)
Aug 27 12:33:59 [46849] node3 pacemakerd: info: pcmk_cpg_membership:
Node 2273159360 left group pacemakerd (peer=node3, counter=13.0)
Aug 27 12:33:59 [46849] node3 pacemakerd: info: crm_update_peer_proc:
pcmk_cpg_membership: Node node3[2273159360] - corosync-cpg is now offline
Aug 27 12:33:59 [46849] node3 pacemakerd: info: pcmk_cpg_membership:
Node 2172496064 still member of group pacemakerd (peer=node2, counter=13.0)
Aug 27 12:33:59 [46849] node3 pacemakerd: error: pcmk_cpg_membership:
We're not part of CPG group 'pacemakerd' anymore!
Aug 27 12:33:59 [46849] node3 pacemakerd: error: pcmk_cpg_dispatch: Evicted
from CPG membership
Aug 27 12:33:59 [46849] node3 pacemakerd: error: mcp_cpg_destroy:
Connection destroyed
Aug 27 12:33:59 [46849] node3 pacemakerd: info: crm_xml_cleanup:
Cleaning up memory from libxml2
Aug 27 12:33:59 [46858] node3 attrd: error: crm_ipc_read:
Connection to pacemakerd failed
Aug 27 12:33:59 [46858] node3 attrd: error: mainloop_gio_callback:
Connection to pacemakerd[0x1255eb0] closed (I/O condition=17)
Aug 27 12:33:59 [46858] node3 attrd: crit: attrd_cs_destroy: Lost
connection to Corosync service!
Aug 27 12:33:59 [46858] node3 attrd: notice: main: Exiting...
Aug 27 12:33:59 [46858] node3 attrd: notice: main: Disconnecting
client 0x12579a0, pid=46860...
Aug 27 12:33:59 [46858] node3 attrd: error:
attrd_cib_connection_destroy: Connection to the CIB terminated...
Aug 27 12:33:59 corosync [pcmk ] info: pcmk_ipc_exit: Client attrd
(conn=0x1955f80, async-conn=0x1955f80) left
Aug 27 12:33:59 [46856] node3 stonith-ng: error: crm_ipc_read:
Connection to pacemakerd failed
Aug 27 12:33:59 [46856] node3 stonith-ng: error: mainloop_gio_callback:
Connection to pacemakerd[0x2314af0] closed (I/O condition=17)
Aug 27 12:33:59 [46856] node3 stonith-ng: error: stonith_peer_cs_destroy:
Corosync connection terminated
Aug 27 12:33:59 [46856] node3 stonith-ng: info: stonith_shutdown:
Terminating with 1 clients
Aug 27 12:33:59 [46856] node3 stonith-ng: info: cib_connection_destroy:
Connection to the CIB closed.
Aug 27 12:33:59 [46856] node3 stonith-ng: info: qb_ipcs_us_withdraw:
withdrawing server sockets
Aug 27 12:33:59 [46856] node3 stonith-ng: info: main: Done
Aug 27 12:33:59 [46856] node3 stonith-ng: info: crm_xml_cleanup:
Cleaning up memory from libxml2
Aug 27 12:34:00 corosync [MAIN ] Completed service synchronization, ready to
provide service.
Aug 27 12:34:00 corosync [pcmk ] info: pcmk_ipc_exit: Client stonith-ng
(conn=0x195aa50, async-conn=0x195aa50) left
_______________________________________________
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org