Re: [Pacemaker] Network outage debugging

Andrew Beekhof Tue, 12 Nov 2013 18:52:33 -0800

On 13 Nov 2013, at 11:49 am, Sean Lutner <s...@rentul.net> wrote:

> 
> 
>> On Nov 12, 2013, at 7:33 PM, Andrew Beekhof <and...@beekhof.net> wrote:
>> 
>> 
>>> On 13 Nov 2013, at 11:22 am, Sean Lutner <s...@rentul.net> wrote:
>>> 
>>> 
>>> 
>>>> On Nov 12, 2013, at 6:01 PM, Andrew Beekhof <and...@beekhof.net> wrote:
>>>> 
>>>> 
>>>>> On 13 Nov 2013, at 6:10 am, Sean Lutner <s...@rentul.net> wrote:
>>>>> 
>>>>> The folks testing the cluster I've been building have run a script which 
>>>>> blocks all traffic except SSH on one node of the cluster for 15 seconds 
>>>>> to mimic a network failure. During this time, the network being "down" 
>>>>> seems to cause some odd behavior from pacemaker resulting in it dying.
>>>>> 
>>>>> The cluster is two nodes and running four custom resources on EC2 
>>>>> instances. The OS is CentOS 6.4 with the config below:
>>>>> 
>>>>> I've attached the /var/log/messages and /var/log/cluster/corosync.log 
>>>>> from the time period during the test. I've having some difficulty in 
>>>>> piecing together what happened and am hoping someone can shed some light 
>>>>> on the problem. Any indications why pacemaker is dying on that node?
>>>> 
>>>> Because corosync is dying underneath it:
>>>> 
>>>> Nov 09 14:51:49 [942] ip-10-50-3-251        cib:    error: send_ais_text:  
>>>>   Sending message 28 via cpg: FAILED (rc=2): Library error: Connection 
>>>> timed out (110)
>>>> Nov 09 14:51:49 [942] ip-10-50-3-251        cib:    error: 
>>>> pcmk_cpg_dispatch:    Connection to the CPG API failed: 2
>>>> Nov 09 14:51:49 [942] ip-10-50-3-251        cib:    error: 
>>>> cib_ais_destroy:    Corosync connection lost!  Exiting.
>>>> Nov 09 14:51:49 [942] ip-10-50-3-251        cib:     info: terminate_cib:  
>>>>   cib_ais_destroy: Exiting fast...
>>> 
>>> Is that the expected behavior?
>> 
>> It is expected behaviour when corosync dies.  Ideally corosync wouldn't die 
>> though.
> 
> What other debugging can I do to try to find out why corosync died?


There are various logging setting that may help.
CC'ing Jan to see if he has any suggestions.

> 
> Thanks
> 
>> 
>>> Is it because the DC was the other node?
>> 
>> No.
>> 
>>> 
>>> I did notice that there was an attempted fence operation but it didn't look 
>>> successful. 
>>> 
>>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> [root@ip-10-50-3-122 ~]# pcs config
>>>>> Corosync Nodes:
>>>>> 
>>>>> Pacemaker Nodes:
>>>>> ip-10-50-3-122 ip-10-50-3-251 
>>>>> 
>>>>> Resources: 
>>>>> Resource: ClusterEIP_54.215.143.166 (provider=pacemaker type=EIP 
>>>>> class=ocf)
>>>>> Attributes: first_network_interface_id=eni-e4e0b68c 
>>>>> second_network_interface_id=eni-35f9af5d first_private_ip=10.50.3.191 
>>>>> second_private_ip=10.50.3.91 eip=54.215.143.166 
>>>>> alloc_id=eipalloc-376c3c5f interval=5s 
>>>>> Operations: monitor interval=5s
>>>>> Clone: EIP-AND-VARNISH-clone
>>>>> Group: EIP-AND-VARNISH
>>>>> Resource: Varnish (provider=redhat type=varnish.sh class=ocf)
>>>>> Operations: monitor interval=5s
>>>>> Resource: Varnishlog (provider=redhat type=varnishlog.sh class=ocf)
>>>>> Operations: monitor interval=5s
>>>>> Resource: Varnishncsa (provider=redhat type=varnishncsa.sh class=ocf)
>>>>> Operations: monitor interval=5s
>>>>> Resource: ec2-fencing (type=fence_ec2 class=stonith)
>>>>> Attributes: ec2-home=/opt/ec2-api-tools pcmk_host_check=static-list 
>>>>> pcmk_host_list=HA01 HA02 
>>>>> Operations: monitor start-delay=30s interval=0 timeout=150s
>>>>> 
>>>>> Location Constraints:
>>>>> Ordering Constraints:
>>>>> ClusterEIP_54.215.143.166 then Varnish
>>>>> Varnish then Varnishlog
>>>>> Varnishlog then Varnishncsa
>>>>> Colocation Constraints:
>>>>> Varnish with ClusterEIP_54.215.143.166
>>>>> Varnishlog with Varnish
>>>>> Varnishncsa with Varnishlog
>>>>> 
>>>>> Cluster Properties:
>>>>> dc-version: 1.1.8-7.el6-394e906
>>>>> cluster-infrastructure: cman
>>>>> last-lrm-refresh: 1384196963
>>>>> no-quorum-policy: ignore
>>>>> stonith-enabled: true
>>>>> 
>>>>> <net-failure-messages-110913.out><net-failure-corosync-110913.out>
>>>>> _______________________________________________
>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>> 
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>> 
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>> 
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>> 
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Network outage debugging

Reply via email to