Re: [Pacemaker] Network outage debugging

Andrew Beekhof Tue, 12 Nov 2013 15:06:24 -0800

On 13 Nov 2013, at 6:10 am, Sean Lutner <s...@rentul.net> wrote:

> The folks testing the cluster I've been building have run a script which 
> blocks all traffic except SSH on one node of the cluster for 15 seconds to 
> mimic a network failure. During this time, the network being "down" seems to 
> cause some odd behavior from pacemaker resulting in it dying.
> 
> The cluster is two nodes and running four custom resources on EC2 instances. 
> The OS is CentOS 6.4 with the config below:
> 
> I've attached the /var/log/messages and /var/log/cluster/corosync.log from 
> the time period during the test. I've having some difficulty in piecing 
> together what happened and am hoping someone can shed some light on the 
> problem. Any indications why pacemaker is dying on that node?


Because corosync is dying underneath it:

Nov 09 14:51:49 [942] ip-10-50-3-251        cib:    error: send_ais_text:       
Sending message 28 via cpg: FAILED (rc=2): Library error: Connection timed out 
(110)
Nov 09 14:51:49 [942] ip-10-50-3-251        cib:    error: pcmk_cpg_dispatch:   
Connection to the CPG API failed: 2
Nov 09 14:51:49 [942] ip-10-50-3-251        cib:    error: cib_ais_destroy:     
Corosync connection lost!  Exiting.
Nov 09 14:51:49 [942] ip-10-50-3-251        cib:     info: terminate_cib:       
cib_ais_destroy: Exiting fast...


> 
> 
> [root@ip-10-50-3-122 ~]# pcs config
> Corosync Nodes:
> 
> Pacemaker Nodes:
> ip-10-50-3-122 ip-10-50-3-251 
> 
> Resources: 
> Resource: ClusterEIP_54.215.143.166 (provider=pacemaker type=EIP class=ocf)
>  Attributes: first_network_interface_id=eni-e4e0b68c 
> second_network_interface_id=eni-35f9af5d first_private_ip=10.50.3.191 
> second_private_ip=10.50.3.91 eip=54.215.143.166 alloc_id=eipalloc-376c3c5f 
> interval=5s 
>  Operations: monitor interval=5s
> Clone: EIP-AND-VARNISH-clone
>  Group: EIP-AND-VARNISH
>   Resource: Varnish (provider=redhat type=varnish.sh class=ocf)
>    Operations: monitor interval=5s
>   Resource: Varnishlog (provider=redhat type=varnishlog.sh class=ocf)
>    Operations: monitor interval=5s
>   Resource: Varnishncsa (provider=redhat type=varnishncsa.sh class=ocf)
>    Operations: monitor interval=5s
> Resource: ec2-fencing (type=fence_ec2 class=stonith)
>  Attributes: ec2-home=/opt/ec2-api-tools pcmk_host_check=static-list 
> pcmk_host_list=HA01 HA02 
>  Operations: monitor start-delay=30s interval=0 timeout=150s
> 
> Location Constraints:
> Ordering Constraints:
>  ClusterEIP_54.215.143.166 then Varnish
>  Varnish then Varnishlog
>  Varnishlog then Varnishncsa
> Colocation Constraints:
>  Varnish with ClusterEIP_54.215.143.166
>  Varnishlog with Varnish
>  Varnishncsa with Varnishlog
> 
> Cluster Properties:
> dc-version: 1.1.8-7.el6-394e906
> cluster-infrastructure: cman
> last-lrm-refresh: 1384196963
> no-quorum-policy: ignore
> stonith-enabled: true
> 
> <net-failure-messages-110913.out><net-failure-corosync-110913.out>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Network outage debugging

Reply via email to