Re: [ClusterLabs] data loss of network would cause Pacemaker exit abnormally

2016-08-31 Thread Ken Gaillot
On 08/30/2016 01:58 PM, chenhj wrote:
> Hi,
> 
> This is a continuation of the email below(I did not subscrib this maillist)
> 
> http://clusterlabs.org/pipermail/users/2016-August/003838.html
> 
>>>From the above, I suspect that the node with the network loss was the
>>DC, and from its point of view, it was the other node that went away.
> 
> Yes. the node with the network loss was DC(node2)
> 
> Could someone explain what's the following messges means, and 
> why pacemakerd process exit instead of rejoin to CPG group?
> 
>> Aug 27 12:33:59 [46849] node3 pacemakerd:error: pcmk_cpg_membership:
>>We're not part of CPG group 'pacemakerd' anymore!

This means the node was kicked out of the membership. I don't remember
what that implies, I'm guessing the node exits because the cluster will
most likely fence it after kicking it out.

> 
>>> [root at node3 ~]# rpm -q corosync
>>> corosync-1.4.1-7.el6.x86_64
>>That is quite old ...
>>> [root at node3 ~]# cat /etc/redhat-release 
>>> CentOS release 6.3 (Final)
>>> [root at node3 ~]# pacemakerd -F
>> Pacemaker 1.1.14-1.el6 (Build: 70404b0)
>>and I doubt that many people have tested Pacemaker 1.1.14 against
>>corosync 1.4.1 ... quite far away from
>>each other release-wise ...
> 
> pacemaker 1.1.14 + corosync-1.4.7 can also reproduced this probleam, but
> seems with lower probability.

The corosync 2 series is a major improvement, but some config changes
are necessary


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] data loss of network would cause Pacemaker exit abnormally

2016-08-31 Thread chenhj
Hi,


This is a continuation of the email below(I did not subscrib this maillist)


http://clusterlabs.org/pipermail/users/2016-August/003838.html


>>From the above, I suspect that the node with the network loss was the
>DC, and from its point of view, it was the other node that went away.


Yes. the node with the network loss was DC(node2)


Could someone explain what's the following messges means, and 
why pacemakerd process exit instead of rejoin to CPG group?


> Aug 27 12:33:59 [46849] node3 pacemakerd:error: pcmk_cpg_membership:
>We're not part of CPG group 'pacemakerd' anymore!




>> [root at node3 ~]# rpm -q corosync
>> corosync-1.4.1-7.el6.x86_64
>That is quite old ...
>> [root at node3 ~]# cat /etc/redhat-release 
>> CentOS release 6.3 (Final)
>> [root at node3 ~]# pacemakerd -F
> Pacemaker 1.1.14-1.el6 (Build: 70404b0)
>and I doubt that many people have tested Pacemaker 1.1.14 against
>corosync 1.4.1 ... quite far away from
>each other release-wise ...


pacemaker 1.1.14 + corosync-1.4.7 can also reproduced this probleam, but seems 
with lower probability.___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] data loss of network would cause Pacemaker exit abnormally

2016-08-29 Thread Ken Gaillot
On 08/27/2016 09:15 PM, chenhj wrote:
> Hi all,
> 
> When i use the following command to simulate data lost of network at one
> member of my 3 nodes Pacemaker+Corosync cluster,
> sometimes it cause Pacemaker on another node exit.
> 
>   tc qdisc add dev eth2 root netem loss 90%
> 
> Is there any method to avoid this proleam?
> 
> [root@node3 ~]# ps -ef|grep pacemaker
> root  32540  1  0 00:57 ?00:00:00
> /usr/libexec/pacemaker/lrmd
> 189   32542  1  0 00:57 ?00:00:00
> /usr/libexec/pacemaker/pengine
> root  33491  11491  0 00:58 pts/100:00:00 grep pacemaker
> 
> /var/log/cluster/corosync.log 
> 
> Aug 27 12:33:59 [46855] node3cib: info: cib_process_request:
>Completed cib_modify operation for section status: OK (rc=0,
> origin=local/attrd/230, version=10.657.19)
> Aug 27 12:33:59 corosync [CPG   ] chosen downlist: sender r(0)
> ip(192.168.125.129) ; members(old:2 left:1)
> Aug 27 12:33:59 [46849] node3 pacemakerd: info: pcmk_cpg_membership:
>Node 2172496064 joined group pacemakerd (counter=12.0)
> Aug 27 12:33:59 [46849] node3 pacemakerd: info: pcmk_cpg_membership:
>Node 2172496064 still member of group pacemakerd (peer=node2,
> counter=12.0)
> Aug 27 12:33:59 [46849] node3 pacemakerd: info:
> crm_update_peer_proc:   pcmk_cpg_membership: Node node2[2172496064]
> - corosync-cpg is now online
> Aug 27 12:33:59 [46849] node3 pacemakerd: info: pcmk_cpg_membership:
>Node 2273159360 still member of group pacemakerd (peer=node3,
> counter=12.1)
> Aug 27 12:33:59 [46849] node3 pacemakerd: info: crm_cs_flush:  
> Sent 0 CPG messages  (1 remaining, last=19): Try again (6)
> Aug 27 12:33:59 [46849] node3 pacemakerd: info: pcmk_cpg_membership:
>Node 2273159360 left group pacemakerd (peer=node3, counter=13.0)
> Aug 27 12:33:59 [46849] node3 pacemakerd: info:
> crm_update_peer_proc:   pcmk_cpg_membership: Node node3[2273159360]
> - corosync-cpg is now offline
> Aug 27 12:33:59 [46849] node3 pacemakerd: info: pcmk_cpg_membership:
>Node 2172496064 still member of group pacemakerd (peer=node2,
> counter=13.0)
> Aug 27 12:33:59 [46849] node3 pacemakerd:error: pcmk_cpg_membership:
>We're not part of CPG group 'pacemakerd' anymore!
> Aug 27 12:33:59 [46849] node3 pacemakerd:error: pcmk_cpg_dispatch:
>  Evicted from CPG membership

>From the above, I suspect that the node with the network loss was the
DC, and from its point of view, it was the other node that went away.

Proper quorum and fencing configuration should prevent this from being
an issue. Once the one node sees heavy network loss, the other node(s)
should fence it before it causes too many problems.

> Aug 27 12:33:59 [46849] node3 pacemakerd:error: mcp_cpg_destroy:  
>  Connection destroyed
> Aug 27 12:33:59 [46849] node3 pacemakerd: info: crm_xml_cleanup:  
>  Cleaning up memory from libxml2
> Aug 27 12:33:59 [46858] node3  attrd:error: crm_ipc_read:  
> Connection to pacemakerd failed
> Aug 27 12:33:59 [46858] node3  attrd:error:
> mainloop_gio_callback:  Connection to pacemakerd[0x1255eb0] closed
> (I/O condition=17)
> Aug 27 12:33:59 [46858] node3  attrd: crit: attrd_cs_destroy:  
> Lost connection to Corosync service!
> Aug 27 12:33:59 [46858] node3  attrd:   notice: main:   Exiting...
> Aug 27 12:33:59 [46858] node3  attrd:   notice: main:  
> Disconnecting client 0x12579a0, pid=46860...
> Aug 27 12:33:59 [46858] node3  attrd:error:
> attrd_cib_connection_destroy:   Connection to the CIB terminated...
> Aug 27 12:33:59 corosync [pcmk  ] info: pcmk_ipc_exit: Client attrd
> (conn=0x1955f80, async-conn=0x1955f80) left
> Aug 27 12:33:59 [46856] node3 stonith-ng:error: crm_ipc_read:  
> Connection to pacemakerd failed
> Aug 27 12:33:59 [46856] node3 stonith-ng:error:
> mainloop_gio_callback:  Connection to pacemakerd[0x2314af0] closed
> (I/O condition=17)
> Aug 27 12:33:59 [46856] node3 stonith-ng:error:
> stonith_peer_cs_destroy:Corosync connection terminated
> Aug 27 12:33:59 [46856] node3 stonith-ng: info: stonith_shutdown:  
> Terminating with  1 clients
> Aug 27 12:33:59 [46856] node3 stonith-ng: info:
> cib_connection_destroy: Connection to the CIB closed.
> ...
> 
> please see corosynclog.txt for detail of log 
> 
> 
> [root@node3 ~]# cat /etc/corosync/corosync.conf
> totem {
>version: 2
>secauth: off
>interface {
>member {
>memberaddr: 192.168.125.134
>}
>member {
>memberaddr: 192.168.125.129
>}
>member {
>memberaddr: 192.168.125.135
>}
> 
>ringnumber: 0
>bindnetaddr: 192.168.125.135
>mcastport: 5405
>

Re: [ClusterLabs] data loss of network would cause Pacemaker exit abnormally

2016-08-29 Thread Klaus Wenninger
On 08/28/2016 04:15 AM, chenhj wrote:
> Hi all,
>
> When i use the following command to simulate data lost of network at
> one member of my 3 nodes Pacemaker+Corosync cluster,
> sometimes it cause Pacemaker on another node exit.
>
>   tc qdisc add dev eth2 root netem loss 90%
>
> Is there any method to avoid this proleam?
>
> [root@node3 ~]# ps -ef|grep pacemaker
> root  32540  1  0 00:57 ?00:00:00
> /usr/libexec/pacemaker/lrmd
> 189   32542  1  0 00:57 ?00:00:00
> /usr/libexec/pacemaker/pengine
> root  33491  11491  0 00:58 pts/100:00:00 grep pacemaker
>
> /var/log/cluster/corosync.log 
> 
> Aug 27 12:33:59 [46855] node3cib: info:
> cib_process_request:Completed cib_modify operation for section
> status: OK (rc=0, origin=local/attrd/230, version=10.657.19)
> Aug 27 12:33:59 corosync [CPG   ] chosen downlist: sender r(0)
> ip(192.168.125.129) ; members(old:2 left:1)
> Aug 27 12:33:59 [46849] node3 pacemakerd: info:
> pcmk_cpg_membership:Node 2172496064 joined group pacemakerd
> (counter=12.0)
> Aug 27 12:33:59 [46849] node3 pacemakerd: info:
> pcmk_cpg_membership:Node 2172496064 still member of group
> pacemakerd (peer=node2, counter=12.0)
> Aug 27 12:33:59 [46849] node3 pacemakerd: info:
> crm_update_peer_proc:   pcmk_cpg_membership: Node
> node2[2172496064] - corosync-cpg is now online
> Aug 27 12:33:59 [46849] node3 pacemakerd: info:
> pcmk_cpg_membership:Node 2273159360 still member of group
> pacemakerd (peer=node3, counter=12.1)
> Aug 27 12:33:59 [46849] node3 pacemakerd: info: crm_cs_flush:
>   Sent 0 CPG messages  (1 remaining, last=19): Try again (6)
> Aug 27 12:33:59 [46849] node3 pacemakerd: info:
> pcmk_cpg_membership:Node 2273159360 left group pacemakerd
> (peer=node3, counter=13.0)
> Aug 27 12:33:59 [46849] node3 pacemakerd: info:
> crm_update_peer_proc:   pcmk_cpg_membership: Node
> node3[2273159360] - corosync-cpg is now offline
> Aug 27 12:33:59 [46849] node3 pacemakerd: info:
> pcmk_cpg_membership:Node 2172496064 still member of group
> pacemakerd (peer=node2, counter=13.0)
> Aug 27 12:33:59 [46849] node3 pacemakerd:error:
> pcmk_cpg_membership:We're not part of CPG group 'pacemakerd'
> anymore!
> Aug 27 12:33:59 [46849] node3 pacemakerd:error: pcmk_cpg_dispatch:
>  Evicted from CPG membership
> Aug 27 12:33:59 [46849] node3 pacemakerd:error: mcp_cpg_destroy:  
>  Connection destroyed
> Aug 27 12:33:59 [46849] node3 pacemakerd: info: crm_xml_cleanup:  
>  Cleaning up memory from libxml2
> Aug 27 12:33:59 [46858] node3  attrd:error: crm_ipc_read:
>   Connection to pacemakerd failed
> Aug 27 12:33:59 [46858] node3  attrd:error:
> mainloop_gio_callback:  Connection to pacemakerd[0x1255eb0] closed
> (I/O condition=17)
> Aug 27 12:33:59 [46858] node3  attrd: crit: attrd_cs_destroy:
>   Lost connection to Corosync service!
> Aug 27 12:33:59 [46858] node3  attrd:   notice: main:   Exiting...
> Aug 27 12:33:59 [46858] node3  attrd:   notice: main:  
> Disconnecting client 0x12579a0, pid=46860...
> Aug 27 12:33:59 [46858] node3  attrd:error:
> attrd_cib_connection_destroy:   Connection to the CIB terminated...
> Aug 27 12:33:59 corosync [pcmk  ] info: pcmk_ipc_exit: Client attrd
> (conn=0x1955f80, async-conn=0x1955f80) left
> Aug 27 12:33:59 [46856] node3 stonith-ng:error: crm_ipc_read:
>   Connection to pacemakerd failed
> Aug 27 12:33:59 [46856] node3 stonith-ng:error:
> mainloop_gio_callback:  Connection to pacemakerd[0x2314af0] closed
> (I/O condition=17)
> Aug 27 12:33:59 [46856] node3 stonith-ng:error:
> stonith_peer_cs_destroy:Corosync connection terminated
> Aug 27 12:33:59 [46856] node3 stonith-ng: info: stonith_shutdown:
>   Terminating with  1 clients
> Aug 27 12:33:59 [46856] node3 stonith-ng: info:
> cib_connection_destroy: Connection to the CIB closed.
> ...
>
> please see corosynclog.txt for detail of log 
>
>
> [root@node3 ~]# cat /etc/corosync/corosync.conf
> totem {
>version: 2
>secauth: off
>interface {
>member {
>memberaddr: 192.168.125.134
>}
>member {
>memberaddr: 192.168.125.129
>}
>member {
>memberaddr: 192.168.125.135
>}
>
>ringnumber: 0
>bindnetaddr: 192.168.125.135
>mcastport: 5405
>ttl: 1
>}
>transport: udpu
> }
>
> logging {
>fileline: off
>to_logfile: yes
>to_syslog: no
>logfile: /var/log/cluster/corosync.log
>debug: off
>timestamp: on
>logger_subsys {
>subsys: AMF
>debug: off
>}
> }
>
> service {
>  

[ClusterLabs] data loss of network would cause Pacemaker exit abnormally

2016-08-28 Thread chenhj
Hi all,


When i use the following command to simulate data lost of network at one member 
of my 3 nodes Pacemaker+Corosync cluster,
sometimes it cause Pacemaker on another node exit.


  tc qdisc add dev eth2 root netem loss 90%


Is there any method to avoid this proleam?


[root@node3 ~]# ps -ef|grep pacemaker
root  32540  1  0 00:57 ?00:00:00 /usr/libexec/pacemaker/lrmd
189   32542  1  0 00:57 ?00:00:00 /usr/libexec/pacemaker/pengine
root  33491  11491  0 00:58 pts/100:00:00 grep pacemaker


/var/log/cluster/corosync.log 

Aug 27 12:33:59 [46855] node3cib: info: cib_process_request:
Completed cib_modify operation for section status: OK (rc=0, 
origin=local/attrd/230, version=10.657.19)
Aug 27 12:33:59 corosync [CPG   ] chosen downlist: sender r(0) 
ip(192.168.125.129) ; members(old:2 left:1)
Aug 27 12:33:59 [46849] node3 pacemakerd: info: pcmk_cpg_membership:
Node 2172496064 joined group pacemakerd (counter=12.0)
Aug 27 12:33:59 [46849] node3 pacemakerd: info: pcmk_cpg_membership:
Node 2172496064 still member of group pacemakerd (peer=node2, counter=12.0)
Aug 27 12:33:59 [46849] node3 pacemakerd: info: crm_update_peer_proc:   
pcmk_cpg_membership: Node node2[2172496064] - corosync-cpg is now online
Aug 27 12:33:59 [46849] node3 pacemakerd: info: pcmk_cpg_membership:
Node 2273159360 still member of group pacemakerd (peer=node3, counter=12.1)
Aug 27 12:33:59 [46849] node3 pacemakerd: info: crm_cs_flush:   Sent 0 
CPG messages  (1 remaining, last=19): Try again (6)
Aug 27 12:33:59 [46849] node3 pacemakerd: info: pcmk_cpg_membership:
Node 2273159360 left group pacemakerd (peer=node3, counter=13.0)
Aug 27 12:33:59 [46849] node3 pacemakerd: info: crm_update_peer_proc:   
pcmk_cpg_membership: Node node3[2273159360] - corosync-cpg is now offline
Aug 27 12:33:59 [46849] node3 pacemakerd: info: pcmk_cpg_membership:
Node 2172496064 still member of group pacemakerd (peer=node2, counter=13.0)
Aug 27 12:33:59 [46849] node3 pacemakerd:error: pcmk_cpg_membership:
We're not part of CPG group 'pacemakerd' anymore!
Aug 27 12:33:59 [46849] node3 pacemakerd:error: pcmk_cpg_dispatch:  Evicted 
from CPG membership
Aug 27 12:33:59 [46849] node3 pacemakerd:error: mcp_cpg_destroy:
Connection destroyed
Aug 27 12:33:59 [46849] node3 pacemakerd: info: crm_xml_cleanup:
Cleaning up memory from libxml2
Aug 27 12:33:59 [46858] node3  attrd:error: crm_ipc_read:   
Connection to pacemakerd failed
Aug 27 12:33:59 [46858] node3  attrd:error: mainloop_gio_callback:  
Connection to pacemakerd[0x1255eb0] closed (I/O condition=17)
Aug 27 12:33:59 [46858] node3  attrd: crit: attrd_cs_destroy:   Lost 
connection to Corosync service!
Aug 27 12:33:59 [46858] node3  attrd:   notice: main:   Exiting...
Aug 27 12:33:59 [46858] node3  attrd:   notice: main:   Disconnecting 
client 0x12579a0, pid=46860...
Aug 27 12:33:59 [46858] node3  attrd:error: 
attrd_cib_connection_destroy:   Connection to the CIB terminated...
Aug 27 12:33:59 corosync [pcmk  ] info: pcmk_ipc_exit: Client attrd 
(conn=0x1955f80, async-conn=0x1955f80) left
Aug 27 12:33:59 [46856] node3 stonith-ng:error: crm_ipc_read:   
Connection to pacemakerd failed
Aug 27 12:33:59 [46856] node3 stonith-ng:error: mainloop_gio_callback:  
Connection to pacemakerd[0x2314af0] closed (I/O condition=17)
Aug 27 12:33:59 [46856] node3 stonith-ng:error: stonith_peer_cs_destroy:
Corosync connection terminated
Aug 27 12:33:59 [46856] node3 stonith-ng: info: stonith_shutdown:   
Terminating with  1 clients
Aug 27 12:33:59 [46856] node3 stonith-ng: info: cib_connection_destroy: 
Connection to the CIB closed.
...


please see corosynclog.txt for detail of log 




[root@node3 ~]# cat /etc/corosync/corosync.conf
totem {
   version: 2
   secauth: off
   interface {
   member {
   memberaddr: 192.168.125.134
   }
   member {
   memberaddr: 192.168.125.129
   }
   member {
   memberaddr: 192.168.125.135
   }


   ringnumber: 0
   bindnetaddr: 192.168.125.135
   mcastport: 5405
   ttl: 1
   }
   transport: udpu
}


logging {
   fileline: off
   to_logfile: yes
   to_syslog: no
   logfile: /var/log/cluster/corosync.log
   debug: off
   timestamp: on
   logger_subsys {
   subsys: AMF
   debug: off
   }
}


service {
   ver: 1
   name: pacemaker
}


Environment:
[root@node3 ~]# rpm -q corosync
corosync-1.4.1-7.el6.x86_64
[root@node3 ~]# cat /etc/redhat-release 
CentOS release 6.3 (Final)
[root@node3 ~]# pacemakerd -F
Pacemaker 1.1.14-1.el6