On 08/28/2016 04:15 AM, chenhj wrote: > Hi all, > > When i use the following command to simulate data lost of network at > one member of my 3 nodes Pacemaker+Corosync cluster, > sometimes it cause Pacemaker on another node exit. > > tc qdisc add dev eth2 root netem loss 90% > > Is there any method to avoid this proleam? > > [root@node3 ~]# ps -ef|grep pacemaker > root 32540 1 0 00:57 ? 00:00:00 > /usr/libexec/pacemaker/lrmd > 189 32542 1 0 00:57 ? 00:00:00 > /usr/libexec/pacemaker/pengine > root 33491 11491 0 00:58 pts/1 00:00:00 grep pacemaker > > /var/log/cluster/corosync.log > ------------------------------------------------ > Aug 27 12:33:59 [46855] node3 cib: info: > cib_process_request: Completed cib_modify operation for section > status: OK (rc=0, origin=local/attrd/230, version=10.657.19) > Aug 27 12:33:59 corosync [CPG ] chosen downlist: sender r(0) > ip(192.168.125.129) ; members(old:2 left:1) > Aug 27 12:33:59 [46849] node3 pacemakerd: info: > pcmk_cpg_membership: Node 2172496064 joined group pacemakerd > (counter=12.0) > Aug 27 12:33:59 [46849] node3 pacemakerd: info: > pcmk_cpg_membership: Node 2172496064 still member of group > pacemakerd (peer=node2, counter=12.0) > Aug 27 12:33:59 [46849] node3 pacemakerd: info: > crm_update_peer_proc: pcmk_cpg_membership: Node > node2[2172496064] - corosync-cpg is now online > Aug 27 12:33:59 [46849] node3 pacemakerd: info: > pcmk_cpg_membership: Node 2273159360 still member of group > pacemakerd (peer=node3, counter=12.1) > Aug 27 12:33:59 [46849] node3 pacemakerd: info: crm_cs_flush: > Sent 0 CPG messages (1 remaining, last=19): Try again (6) > Aug 27 12:33:59 [46849] node3 pacemakerd: info: > pcmk_cpg_membership: Node 2273159360 left group pacemakerd > (peer=node3, counter=13.0) > Aug 27 12:33:59 [46849] node3 pacemakerd: info: > crm_update_peer_proc: pcmk_cpg_membership: Node > node3[2273159360] - corosync-cpg is now offline > Aug 27 12:33:59 [46849] node3 pacemakerd: info: > pcmk_cpg_membership: Node 2172496064 still member of group > pacemakerd (peer=node2, counter=13.0) > Aug 27 12:33:59 [46849] node3 pacemakerd: error: > pcmk_cpg_membership: We're not part of CPG group 'pacemakerd' > anymore! > Aug 27 12:33:59 [46849] node3 pacemakerd: error: pcmk_cpg_dispatch: > Evicted from CPG membership > Aug 27 12:33:59 [46849] node3 pacemakerd: error: mcp_cpg_destroy: > Connection destroyed > Aug 27 12:33:59 [46849] node3 pacemakerd: info: crm_xml_cleanup: > Cleaning up memory from libxml2 > Aug 27 12:33:59 [46858] node3 attrd: error: crm_ipc_read: > Connection to pacemakerd failed > Aug 27 12:33:59 [46858] node3 attrd: error: > mainloop_gio_callback: Connection to pacemakerd[0x1255eb0] closed > (I/O condition=17) > Aug 27 12:33:59 [46858] node3 attrd: crit: attrd_cs_destroy: > Lost connection to Corosync service! > Aug 27 12:33:59 [46858] node3 attrd: notice: main: Exiting... > Aug 27 12:33:59 [46858] node3 attrd: notice: main: > Disconnecting client 0x12579a0, pid=46860... > Aug 27 12:33:59 [46858] node3 attrd: error: > attrd_cib_connection_destroy: Connection to the CIB terminated... > Aug 27 12:33:59 corosync [pcmk ] info: pcmk_ipc_exit: Client attrd > (conn=0x1955f80, async-conn=0x1955f80) left > Aug 27 12:33:59 [46856] node3 stonith-ng: error: crm_ipc_read: > Connection to pacemakerd failed > Aug 27 12:33:59 [46856] node3 stonith-ng: error: > mainloop_gio_callback: Connection to pacemakerd[0x2314af0] closed > (I/O condition=17) > Aug 27 12:33:59 [46856] node3 stonith-ng: error: > stonith_peer_cs_destroy: Corosync connection terminated > Aug 27 12:33:59 [46856] node3 stonith-ng: info: stonith_shutdown: > Terminating with 1 clients > Aug 27 12:33:59 [46856] node3 stonith-ng: info: > cib_connection_destroy: Connection to the CIB closed. > ... > > please see corosynclog.txt for detail of log > > > [root@node3 ~]# cat /etc/corosync/corosync.conf > totem { > version: 2 > secauth: off > interface { > member { > memberaddr: 192.168.125.134 > } > member { > memberaddr: 192.168.125.129 > } > member { > memberaddr: 192.168.125.135 > } > > ringnumber: 0 > bindnetaddr: 192.168.125.135 > mcastport: 5405 > ttl: 1 > } > transport: udpu > } > > logging { > fileline: off > to_logfile: yes > to_syslog: no > logfile: /var/log/cluster/corosync.log > debug: off > timestamp: on > logger_subsys { > subsys: AMF > debug: off > } > } > > service { > ver: 1 > name: pacemaker > } > > Environment: > [root@node3 ~]# rpm -q corosync > corosync-1.4.1-7.el6.x86_64 That is quite old ... > [root@node3 ~]# cat /etc/redhat-release > CentOS release 6.3 (Final) > [root@node3 ~]# pacemakerd -F > Pacemaker 1.1.14-1.el6 (Build: 70404b0) and I doubt that many people have tested Pacemaker 1.1.14 against corosync 1.4.1 ... quite far away from each other release-wise ... > Supporting v3.0.10: generated-manpages agent-manpages ascii-docs > ncurses libqb-logging libqb-ipc nagios corosync-plugin cman acls > > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
_______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org