On Nov 13, 2013, at 10:24 AM, Jan Friesse <jfrie...@redhat.com> wrote:
> Sean Lutner napsal(a): >> >> On Nov 13, 2013, at 3:15 AM, Jan Friesse <jfrie...@redhat.com> wrote: >> >>> Andrew Beekhof napsal(a): >>>> >>>> On 13 Nov 2013, at 11:49 am, Sean Lutner <s...@rentul.net> wrote: >>>> >>>>> >>>>> >>>>>> On Nov 12, 2013, at 7:33 PM, Andrew Beekhof >>>>>> <and...@beekhof.net> wrote: >>>>>> >>>>>> >>>>>>> On 13 Nov 2013, at 11:22 am, Sean Lutner <s...@rentul.net> >>>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On Nov 12, 2013, at 6:01 PM, Andrew Beekhof >>>>>>>> <and...@beekhof.net> wrote: >>>>>>>> >>>>>>>> >>>>>>>>> On 13 Nov 2013, at 6:10 am, Sean Lutner <s...@rentul.net> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> The folks testing the cluster I've been building have run >>>>>>>>> a script which blocks all traffic except SSH on one node >>>>>>>>> of the cluster for 15 seconds to mimic a network failure. >>>>>>>>> During this time, the network being "down" seems to cause >>>>>>>>> some odd behavior from pacemaker resulting in it dying. >>>>>>>>> >>>>>>>>> The cluster is two nodes and running four custom >>>>>>>>> resources on EC2 instances. The OS is CentOS 6.4 with the >>>>>>>>> config below: >>>>>>>>> >>>>>>>>> I've attached the /var/log/messages and >>>>>>>>> /var/log/cluster/corosync.log from the time period during >>>>>>>>> the test. I've having some difficulty in piecing together >>>>>>>>> what happened and am hoping someone can shed some light >>>>>>>>> on the problem. Any indications why pacemaker is dying on >>>>>>>>> that node? >>>>>>>> >>>>>>>> Because corosync is dying underneath it: >>>>>>>> >>>>>>>> Nov 09 14:51:49 [942] ip-10-50-3-251 cib: error: >>>>>>>> send_ais_text: Sending message 28 via cpg: FAILED >>>>>>>> (rc=2): Library error: Connection timed out (110) Nov 09 >>>>>>>> 14:51:49 [942] ip-10-50-3-251 cib: error: >>>>>>>> pcmk_cpg_dispatch: Connection to the CPG API failed: 2 >>>>>>>> Nov 09 14:51:49 [942] ip-10-50-3-251 cib: error: >>>>>>>> cib_ais_destroy: Corosync connection lost! Exiting. Nov >>>>>>>> 09 14:51:49 [942] ip-10-50-3-251 cib: info: >>>>>>>> terminate_cib: cib_ais_destroy: Exiting fast... >>>>>>> >>>>>>> Is that the expected behavior? >>>>>> >>>>>> It is expected behaviour when corosync dies. Ideally corosync >>>>>> wouldn't die though. >>>>> >>>>> What other debugging can I do to try to find out why corosync >>>>> died? >>>> >>>> There are various logging setting that may help. CC'ing Jan to see >>>> if he has any suggestions. >>>> >>> >>> If corosync really died corosync-fplay output (right after corosync >>> death) and coredump are most useful. >>> >>> Regards, >>> Honza >> >> So the process to collect this would be: >> >> - Run the test >> - Watch the logs for corosync to die > >> - Run corosync-fplay and capture the output (will corosync-fplay > file.out >> suffice?) > > Yes. Usually, file is quite large, so gzip/xz is good idea. Thanks, will do. > >> - Capture a core dump from corosync >> >> How do I capture the core dump? Is it something that has to be enabled in >> the /etc/corosync/corosync.conf file first and then run the tests? I've not >> done this in the past. > > This really depends. Do you have abrt enabled? If so, core is processed > via abrt. (Way how to find out if abrt is running is to look to > kernel.core_pattern sysctl. There is something different then classic > value "core"). # sysctl -A |grep core_pattern kernel.core_pattern = /var/tmp/%e-%t-%s.core I looked in that directory and there are some core files, but nothing from the day this failure happened. I'm skeptical that one will be created if I run the test again. Is it accurate to say that whenever corosync dies in the manner seen in the logs, there should be a core file? > > If you do not have abrt enabled, you must make sure to enable core > dumps. When executing corosync via cman, it should be enabled > automatically (start_global function does ulimit -c unlimited). If you > are using corosync itself, create file /etc/default/corosync with > content "ulimit -c unlimited". > > Coredumps are stored in /var/lib/corosync/core.* (maybe you have already > some of them there, so just take a look). > > Now, please install corosynclib-devel package and use > http://stackoverflow.com/questions/5115613/core-dump-file-analysis Thanks, I'll install that package. > > Important part is to execute bt (or even better, thread apply all bt) > and send output from this command. > > Regards, > Honza > > >> Thanks >> >>> >>>>> >>>>> Thanks >>>>> >>>>>> >>>>>>> Is it because the DC was the other node? >>>>>> >>>>>> No. >>>>>> >>>>>>> >>>>>>> I did notice that there was an attempted fence operation but >>>>>>> it didn't look successful. >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> [root@ip-10-50-3-122 ~]# pcs config Corosync Nodes: >>>>>>>>> >>>>>>>>> Pacemaker Nodes: ip-10-50-3-122 ip-10-50-3-251 >>>>>>>>> >>>>>>>>> Resources: Resource: ClusterEIP_54.215.143.166 >>>>>>>>> (provider=pacemaker type=EIP class=ocf) Attributes: >>>>>>>>> first_network_interface_id=eni-e4e0b68c >>>>>>>>> second_network_interface_id=eni-35f9af5d >>>>>>>>> first_private_ip=10.50.3.191 second_private_ip=10.50.3.91 >>>>>>>>> eip=54.215.143.166 alloc_id=eipalloc-376c3c5f interval=5s >>>>>>>>> Operations: monitor interval=5s Clone: >>>>>>>>> EIP-AND-VARNISH-clone Group: EIP-AND-VARNISH Resource: >>>>>>>>> Varnish (provider=redhat type=varnish.sh class=ocf) >>>>>>>>> Operations: monitor interval=5s Resource: Varnishlog >>>>>>>>> (provider=redhat type=varnishlog.sh class=ocf) >>>>>>>>> Operations: monitor interval=5s Resource: Varnishncsa >>>>>>>>> (provider=redhat type=varnishncsa.sh class=ocf) >>>>>>>>> Operations: monitor interval=5s Resource: ec2-fencing >>>>>>>>> (type=fence_ec2 class=stonith) Attributes: >>>>>>>>> ec2-home=/opt/ec2-api-tools pcmk_host_check=static-list >>>>>>>>> pcmk_host_list=HA01 HA02 Operations: monitor >>>>>>>>> start-delay=30s interval=0 timeout=150s >>>>>>>>> >>>>>>>>> Location Constraints: Ordering Constraints: >>>>>>>>> ClusterEIP_54.215.143.166 then Varnish Varnish then >>>>>>>>> Varnishlog Varnishlog then Varnishncsa Colocation >>>>>>>>> Constraints: Varnish with ClusterEIP_54.215.143.166 >>>>>>>>> Varnishlog with Varnish Varnishncsa with Varnishlog >>>>>>>>> >>>>>>>>> Cluster Properties: dc-version: 1.1.8-7.el6-394e906 >>>>>>>>> cluster-infrastructure: cman last-lrm-refresh: >>>>>>>>> 1384196963 no-quorum-policy: ignore stonith-enabled: >>>>>>>>> true >>>>>>>>> >>>>>>>>> <net-failure-messages-110913.out><net-failure-corosync-110913.out> >>>>>>>>> >>>>>>>>> >>> _______________________________________________ >>>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>>>> >>>>>>>>> Project Home: http://www.clusterlabs.org Getting started: >>>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>>> >>>>>>>> _______________________________________________ Pacemaker >>>>>>>> mailing list: Pacemaker@oss.clusterlabs.org >>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>>> >>>>>>>> Project Home: http://www.clusterlabs.org Getting started: >>>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>>> >>>>>>> _______________________________________________ Pacemaker >>>>>>> mailing list: Pacemaker@oss.clusterlabs.org >>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>> >>>>>>> Project Home: http://www.clusterlabs.org Getting started: >>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: >>>>>>> http://bugs.clusterlabs.org >>>>>> >>>>>> _______________________________________________ Pacemaker >>>>>> mailing list: Pacemaker@oss.clusterlabs.org >>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>> >>>>>> Project Home: http://www.clusterlabs.org Getting started: >>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: >>>>>> http://bugs.clusterlabs.org >>>>> >>>>> _______________________________________________ Pacemaker mailing >>>>> list: Pacemaker@oss.clusterlabs.org >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>> >>>>> Project Home: http://www.clusterlabs.org Getting started: >>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: >>>>> http://bugs.clusterlabs.org >>>> >>> >>> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> >> >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org