Re: [Pacemaker] Network outage debugging

Sean Lutner Tue, 12 Nov 2013 16:53:59 -0800


> On Nov 12, 2013, at 7:33 PM, Andrew Beekhof <and...@beekhof.net> wrote:
> 
> 
>> On 13 Nov 2013, at 11:22 am, Sean Lutner <s...@rentul.net> wrote:
>> 
>> 
>> 
>>> On Nov 12, 2013, at 6:01 PM, Andrew Beekhof <and...@beekhof.net> wrote:
>>> 
>>> 
>>>> On 13 Nov 2013, at 6:10 am, Sean Lutner <s...@rentul.net> wrote:
>>>> 
>>>> The folks testing the cluster I've been building have run a script which 
>>>> blocks all traffic except SSH on one node of the cluster for 15 seconds to 
>>>> mimic a network failure. During this time, the network being "down" seems 
>>>> to cause some odd behavior from pacemaker resulting in it dying.
>>>> 
>>>> The cluster is two nodes and running four custom resources on EC2 
>>>> instances. The OS is CentOS 6.4 with the config below:
>>>> 
>>>> I've attached the /var/log/messages and /var/log/cluster/corosync.log from 
>>>> the time period during the test. I've having some difficulty in piecing 
>>>> together what happened and am hoping someone can shed some light on the 
>>>> problem. Any indications why pacemaker is dying on that node?
>>> 
>>> Because corosync is dying underneath it:
>>> 
>>> Nov 09 14:51:49 [942] ip-10-50-3-251        cib:    error: send_ais_text:   
>>>  Sending message 28 via cpg: FAILED (rc=2): Library error: Connection timed 
>>> out (110)
>>> Nov 09 14:51:49 [942] ip-10-50-3-251        cib:    error: 
>>> pcmk_cpg_dispatch:    Connection to the CPG API failed: 2
>>> Nov 09 14:51:49 [942] ip-10-50-3-251        cib:    error: cib_ais_destroy: 
>>>    Corosync connection lost!  Exiting.
>>> Nov 09 14:51:49 [942] ip-10-50-3-251        cib:     info: terminate_cib:   
>>>  cib_ais_destroy: Exiting fast...
>> 
>> Is that the expected behavior?
> 
> It is expected behaviour when corosync dies.  Ideally corosync wouldn't die 
> though.


What other debugging can I do to try to find out why corosync died? 

Thanks

> 
>> Is it because the DC was the other node?
> 
> No.
> 
>> 
>> I did notice that there was an attempted fence operation but it didn't look 
>> successful. 
>> 
>>> 
>>> 
>>>> 
>>>> 
>>>> [root@ip-10-50-3-122 ~]# pcs config
>>>> Corosync Nodes:
>>>> 
>>>> Pacemaker Nodes:
>>>> ip-10-50-3-122 ip-10-50-3-251 
>>>> 
>>>> Resources: 
>>>> Resource: ClusterEIP_54.215.143.166 (provider=pacemaker type=EIP class=ocf)
>>>> Attributes: first_network_interface_id=eni-e4e0b68c 
>>>> second_network_interface_id=eni-35f9af5d first_private_ip=10.50.3.191 
>>>> second_private_ip=10.50.3.91 eip=54.215.143.166 alloc_id=eipalloc-376c3c5f 
>>>> interval=5s 
>>>> Operations: monitor interval=5s
>>>> Clone: EIP-AND-VARNISH-clone
>>>> Group: EIP-AND-VARNISH
>>>> Resource: Varnish (provider=redhat type=varnish.sh class=ocf)
>>>> Operations: monitor interval=5s
>>>> Resource: Varnishlog (provider=redhat type=varnishlog.sh class=ocf)
>>>> Operations: monitor interval=5s
>>>> Resource: Varnishncsa (provider=redhat type=varnishncsa.sh class=ocf)
>>>> Operations: monitor interval=5s
>>>> Resource: ec2-fencing (type=fence_ec2 class=stonith)
>>>> Attributes: ec2-home=/opt/ec2-api-tools pcmk_host_check=static-list 
>>>> pcmk_host_list=HA01 HA02 
>>>> Operations: monitor start-delay=30s interval=0 timeout=150s
>>>> 
>>>> Location Constraints:
>>>> Ordering Constraints:
>>>> ClusterEIP_54.215.143.166 then Varnish
>>>> Varnish then Varnishlog
>>>> Varnishlog then Varnishncsa
>>>> Colocation Constraints:
>>>> Varnish with ClusterEIP_54.215.143.166
>>>> Varnishlog with Varnish
>>>> Varnishncsa with Varnishlog
>>>> 
>>>> Cluster Properties:
>>>> dc-version: 1.1.8-7.el6-394e906
>>>> cluster-infrastructure: cman
>>>> last-lrm-refresh: 1384196963
>>>> no-quorum-policy: ignore
>>>> stonith-enabled: true
>>>> 
>>>> <net-failure-messages-110913.out><net-failure-corosync-110913.out>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>> 
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>> 
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Network outage debugging

Reply via email to