On 13 Nov 2013, at 11:49 am, Sean Lutner <s...@rentul.net> wrote: > > >> On Nov 12, 2013, at 7:33 PM, Andrew Beekhof <and...@beekhof.net> wrote: >> >> >>> On 13 Nov 2013, at 11:22 am, Sean Lutner <s...@rentul.net> wrote: >>> >>> >>> >>>> On Nov 12, 2013, at 6:01 PM, Andrew Beekhof <and...@beekhof.net> wrote: >>>> >>>> >>>>> On 13 Nov 2013, at 6:10 am, Sean Lutner <s...@rentul.net> wrote: >>>>> >>>>> The folks testing the cluster I've been building have run a script which >>>>> blocks all traffic except SSH on one node of the cluster for 15 seconds >>>>> to mimic a network failure. During this time, the network being "down" >>>>> seems to cause some odd behavior from pacemaker resulting in it dying. >>>>> >>>>> The cluster is two nodes and running four custom resources on EC2 >>>>> instances. The OS is CentOS 6.4 with the config below: >>>>> >>>>> I've attached the /var/log/messages and /var/log/cluster/corosync.log >>>>> from the time period during the test. I've having some difficulty in >>>>> piecing together what happened and am hoping someone can shed some light >>>>> on the problem. Any indications why pacemaker is dying on that node? >>>> >>>> Because corosync is dying underneath it: >>>> >>>> Nov 09 14:51:49 [942] ip-10-50-3-251 cib: error: send_ais_text: >>>> Sending message 28 via cpg: FAILED (rc=2): Library error: Connection >>>> timed out (110) >>>> Nov 09 14:51:49 [942] ip-10-50-3-251 cib: error: >>>> pcmk_cpg_dispatch: Connection to the CPG API failed: 2 >>>> Nov 09 14:51:49 [942] ip-10-50-3-251 cib: error: >>>> cib_ais_destroy: Corosync connection lost! Exiting. >>>> Nov 09 14:51:49 [942] ip-10-50-3-251 cib: info: terminate_cib: >>>> cib_ais_destroy: Exiting fast... >>> >>> Is that the expected behavior? >> >> It is expected behaviour when corosync dies. Ideally corosync wouldn't die >> though. > > What other debugging can I do to try to find out why corosync died?
There are various logging setting that may help. CC'ing Jan to see if he has any suggestions. > > Thanks > >> >>> Is it because the DC was the other node? >> >> No. >> >>> >>> I did notice that there was an attempted fence operation but it didn't look >>> successful. >>> >>>> >>>> >>>>> >>>>> >>>>> [root@ip-10-50-3-122 ~]# pcs config >>>>> Corosync Nodes: >>>>> >>>>> Pacemaker Nodes: >>>>> ip-10-50-3-122 ip-10-50-3-251 >>>>> >>>>> Resources: >>>>> Resource: ClusterEIP_54.215.143.166 (provider=pacemaker type=EIP >>>>> class=ocf) >>>>> Attributes: first_network_interface_id=eni-e4e0b68c >>>>> second_network_interface_id=eni-35f9af5d first_private_ip=10.50.3.191 >>>>> second_private_ip=10.50.3.91 eip=54.215.143.166 >>>>> alloc_id=eipalloc-376c3c5f interval=5s >>>>> Operations: monitor interval=5s >>>>> Clone: EIP-AND-VARNISH-clone >>>>> Group: EIP-AND-VARNISH >>>>> Resource: Varnish (provider=redhat type=varnish.sh class=ocf) >>>>> Operations: monitor interval=5s >>>>> Resource: Varnishlog (provider=redhat type=varnishlog.sh class=ocf) >>>>> Operations: monitor interval=5s >>>>> Resource: Varnishncsa (provider=redhat type=varnishncsa.sh class=ocf) >>>>> Operations: monitor interval=5s >>>>> Resource: ec2-fencing (type=fence_ec2 class=stonith) >>>>> Attributes: ec2-home=/opt/ec2-api-tools pcmk_host_check=static-list >>>>> pcmk_host_list=HA01 HA02 >>>>> Operations: monitor start-delay=30s interval=0 timeout=150s >>>>> >>>>> Location Constraints: >>>>> Ordering Constraints: >>>>> ClusterEIP_54.215.143.166 then Varnish >>>>> Varnish then Varnishlog >>>>> Varnishlog then Varnishncsa >>>>> Colocation Constraints: >>>>> Varnish with ClusterEIP_54.215.143.166 >>>>> Varnishlog with Varnish >>>>> Varnishncsa with Varnishlog >>>>> >>>>> Cluster Properties: >>>>> dc-version: 1.1.8-7.el6-394e906 >>>>> cluster-infrastructure: cman >>>>> last-lrm-refresh: 1384196963 >>>>> no-quorum-policy: ignore >>>>> stonith-enabled: true >>>>> >>>>> <net-failure-messages-110913.out><net-failure-corosync-110913.out> >>>>> _______________________________________________ >>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>> >>>>> Project Home: http://www.clusterlabs.org >>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>> Bugs: http://bugs.clusterlabs.org >>>> >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org