Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

Andrew Martin Fri, 02 Nov 2012 11:13:10 -0700

Hi Angus,


Corosync died again while using libqb 0.14.3. Here is the coredump from today:
http://sources.xes-inc.com/downloads/corosync.nov2.coredump



# corosync -f
notice [MAIN ] Corosync Cluster Engine ('2.1.0'): started and ready to provide 
service.
info [MAIN ] Corosync built-in features: pie relro bindnow
Bus error (core dumped)


Here's the log: http://pastebin.com/bUfiB3T3


Did your analysis of the core dump reveal anything?


Is there a way for me to make it generate fdata with a bus error, or how else 
can I gather additional information to help debug this?


Thanks,


Andrew

----- Original Message -----

From: "Angus Salkeld" <asalk...@redhat.com>
To: pacemaker@oss.clusterlabs.org, disc...@corosync.org
Sent: Thursday, November 1, 2012 5:47:16 PM
Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in cluster

On 01/11/12 17:27 -0500, Andrew Martin wrote:
>Hi Angus,
>
>
>I'll try upgrading to the latest libqb tomorrow and see if I can reproduce 
>this behavior with it. I was able to get a coredump by running corosync 
>manually in the foreground (corosync -f):
>http://sources.xes-inc.com/downloads/corosync.coredump

Thanks, looking...

>
>
>There still isn't anything added to /var/lib/corosync however. What do I need 
>to do to enable the fdata file to be created?

Well if it crashes with SIGSEGV it will generate it automatically.
(I see you are getting a bus error) - :(.

-A

>
>
>Thanks,
>
>Andrew
>
>----- Original Message -----
>
>From: "Angus Salkeld" <asalk...@redhat.com>
>To: pacemaker@oss.clusterlabs.org, disc...@corosync.org
>Sent: Thursday, November 1, 2012 5:11:23 PM
>Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in 
>cluster
>
>On 01/11/12 14:32 -0500, Andrew Martin wrote:
>>Hi Honza,
>>
>>
>>Thanks for the help. I enabled core dumps in /etc/security/limits.conf but 
>>didn't have a chance to reboot and apply the changes so I don't have a core 
>>dump this time. Do core dumps need to be enabled for the fdata-DATETIME-PID 
>>file to be generated? right now all that is in /var/lib/corosync are the 
>>ringid_XXX files. Do I need to set something explicitly in the corosync 
>>config to enable this logging?
>>
>>
>>I did find find something else interesting with libqb this time. I compiled 
>>libqb 0.14.2 for use with the cluster. This time when corosync died I noticed 
>>the following in dmesg:
>>Nov 1 13:21:01 storage1 kernel: [31036.617236] corosync[13305] trap divide 
>>error ip:7f657a52e517 sp:7fffd5068858 error:0 in 
>>libqb.so.0.14.2[7f657a525000+1f000]
>>This error was only present for one of the many other times corosync has died.
>>
>>
>>I see that there is a newer version of libqb (0.14.3) out, but didn't see a 
>>fix for this particular bug. Could this libqb problem be related to the 
>>corosync to hang up? Here's the corresponding corosync log file (next time I 
>>should have a core dump as well):
>>http://pastebin.com/5FLKg7We
>
>Hi Andrew
>
>I can't see much wrong with the log either. If you could run with the latest
>(libqb-0.14.3) and post a backtrace if it still happens, that would be great.
>
>Thanks
>Angus
>
>>
>>
>>Thanks,
>>
>>
>>Andrew
>>
>>----- Original Message -----
>>
>>From: "Jan Friesse" <jfrie...@redhat.com>
>>To: "Andrew Martin" <amar...@xes-inc.com>
>>Cc: disc...@corosync.org, "The Pacemaker cluster resource manager" 
>><pacemaker@oss.clusterlabs.org>
>>Sent: Thursday, November 1, 2012 7:55:52 AM
>>Subject: Re: [corosync] Corosync 2.1.0 dies on both nodes in cluster
>>
>>Ansdrew,
>>I was not able to find anything interesting (from corosync point of
>>view) in configuration/logs (corosync related).
>>
>>What would be helpful:
>>- if corosync died, there should be
>>/var/lib/corosync/fdata-DATETTIME-PID of dead corosync. Can you please
>>xz them and store somewhere (they are quiet large but well compressible).
>>- If you are able to reproduce problem (what seems like you are), can
>>you please allow generating of coredumps and store somewhere backtrace
>>of coredump? (coredumps are stored in /var/lib/corosync as core.PID, and 
>>way to obtain coredump is gdb corosync /var/lib/corosync/core.pid, and
>>here thread apply all bt). If you are running distribution with ABRT
>>support, you can also use ABRT to generate report.
>>
>>Regards,
>>Honza
>>
>>Andrew Martin napsal(a):
>>> Corosync died an additional 3 times during the night on storage1. I wrote a 
>>> daemon to attempt and start it as soon as it fails, so only one of those 
>>> times resulted in a STONITH of storage1.
>>>
>>> I enabled debug in the corosync config, so I was able to capture a period 
>>> when corosync died with debug output:
>>> http://pastebin.com/eAmJSmsQ
>>> In this example, Pacemaker finishes shutting down by Nov 01 05:53:02. For 
>>> reference, here is my Pacemaker configuration:
>>> http://pastebin.com/DFL3hNvz
>>>
>>> It seems that an extra node, 16777343 "localhost" has been added to the 
>>> cluster after storage1 was STONTIHed (must be the localhost interface on 
>>> storage1). Is there anyway to prevent this?
>>>
>>> Does this help to determine why corosync is dying, and what I can do to fix 
>>> it?
>>>
>>> Thanks,
>>>
>>> Andrew
>>>
>>> ----- Original Message -----
>>>
>>> From: "Andrew Martin" <amar...@xes-inc.com>
>>> To: disc...@corosync.org
>>> Sent: Thursday, November 1, 2012 12:11:35 AM
>>> Subject: [corosync] Corosync 2.1.0 dies on both nodes in cluster
>>>
>>>
>>> Hello,
>>>
>>> I recently configured a 3-node fileserver cluster by building Corosync 
>>> 2.1.0 and Pacemaker 1.1.8 from source. All of the nodes are running Ubuntu 
>>> 12.04 amd64. Two of the nodes (storage0 and storage1) are "real" nodes 
>>> where the resources run (a DRBD disk, filesystem mount, and samba/nfs 
>>> daemons), while the third node (storagequorum) is in standby mode and acts 
>>> as a quorum node for the cluster. Today I discovered that corosync died on 
>>> both storage0 and storage1 at the same time. Since corosync died, pacemaker 
>>> shut down as well on both nodes. Because the cluster no longer had quorum 
>>> (and the no-quorum-policy="freeze"), storagequorum was unable to STONITH 
>>> either node and just left the resources frozen where they were running, on 
>>> storage0. I cannot find any log information to determine why corosync 
>>> crashed, and this is a disturbing problem as the cluster and its messaging 
>>> layer must be stable. Below is my corosync configuration file as well as 
>>> the corosync log file from each !
n!
>o!
>>de during
>>this period.
>>>
>>> corosync.conf:
>>> http://pastebin.com/vWQDVmg8
>>> Note that I have two redundant rings. On one of them, I specify the IP 
>>> address (in this example 10.10.10.7) so that it binds to the correct 
>>> interface (since potentially in the future those machines may have two 
>>> interfaces on the same subnet).
>>>
>>> corosync.log from storage0:
>>> http://pastebin.com/HK8KYDDQ
>>>
>>> corosync.log from storage1:
>>> http://pastebin.com/sDWkcPUz
>>>
>>> corosync.log from storagequorum (the DC during this period):
>>> http://pastebin.com/uENQ5fnf
>>>
>>> Issuing service corosync start && service pacemaker start on storage0 and 
>>> storage1 resolved the problem and allowed the nodes to successfully 
>>> reconnect to the cluster. What other information can I provide to help 
>>> diagnose this problem and prevent it from recurring?
>>>
>>> Thanks,
>>>
>>> Andrew Martin
>>>
>>> _______________________________________________
>>> discuss mailing list
>>> disc...@corosync.org
>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> discuss mailing list
>>> disc...@corosync.org
>>> http://lists.corosync.org/mailman/listinfo/discuss
>>
>>
>
>>_______________________________________________
>>Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>>Project Home: http://www.clusterlabs.org
>>Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>Bugs: http://bugs.clusterlabs.org
>
>
>_______________________________________________
>discuss mailing list
>disc...@corosync.org
>http://lists.corosync.org/mailman/listinfo/discuss
>

_______________________________________________
discuss mailing list
disc...@corosync.org
http://lists.corosync.org/mailman/listinfo/discuss

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

Reply via email to