Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

Andrew Martin Tue, 06 Nov 2012 15:53:26 -0800

A bit more data on this problem: I was doing some maintenance and had to 
briefly disconnect storagequorum's connection to the STONITH network (ethernet 
cable #7 in this diagram):
http://sources.xes-inc.com/downloads/storagecluster.png



Since corosync has two rings (and is in active mode), this should cause no 
disruption to the cluster. However, as soon as I disconnected cable #7, 
corosync on storage0 died (corosync was already stopped on storage1), which 
caused pacemaker on storage0 to also shutdown. I was not able to obtain a 
coredump this time as apport is still running on storage0.


What else can I do to debug this problem? Or, should I just try to downgrade to 
corosync 1.4.2 (the version available in the Ubuntu repositories)?


Thanks,


Andrew

----- Original Message -----

From: "Andrew Martin" <amar...@xes-inc.com>
To: "Angus Salkeld" <asalk...@redhat.com>
Cc: disc...@corosync.org, pacemaker@oss.clusterlabs.org
Sent: Tuesday, November 6, 2012 2:01:17 PM
Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in cluster


Hi Angus,


I recompiled corosync with the changes you suggested in exec/main.c to generate 
fdata when SIGBUS is triggered. Here 's the corresponding coredump and fdata 
files:
http://sources.xes-inc.com/downloads/core.13027
http://sources.xes-inc.com/downloads/fdata.20121106



(gdb) thread apply all bt


Thread 1 (Thread 0x7ffff7fec700 (LWP 13027)):
#0 0x00007ffff775bda3 in qb_rb_chunk_alloc () from /usr/lib/libqb.so.0
#1 0x00007ffff77656b9 in ?? () from /usr/lib/libqb.so.0
#2 0x00007ffff77637ba in qb_log_real_va_ () from /usr/lib/libqb.so.0
#3 0x0000555555571700 in ?? ()
#4 0x00007ffff7bc7df6 in ?? () from /usr/lib/libtotem_pg.so.5
#5 0x00007ffff7bc1a6d in rrp_deliver_fn () from /usr/lib/libtotem_pg.so.5
#6 0x00007ffff7bbc8e2 in ?? () from /usr/lib/libtotem_pg.so.5
#7 0x00007ffff775d46f in ?? () from /usr/lib/libqb.so.0
#8 0x00007ffff775cfe7 in qb_loop_run () from /usr/lib/libqb.so.0
#9 0x0000555555560945 in main ()




I've also been doing some hardware tests to rule it out as the cause of this 
problem: mcelog has found no problems and memtest finds the memory to be 
healthy as well.


Thanks,


Andrew
----- Original Message -----

From: "Angus Salkeld" <asalk...@redhat.com>
To: pacemaker@oss.clusterlabs.org, disc...@corosync.org
Sent: Friday, November 2, 2012 8:18:51 PM
Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in cluster

On 02/11/12 13:07 -0500, Andrew Martin wrote:
>Hi Angus,
>
>
>Corosync died again while using libqb 0.14.3. Here is the coredump from today:
>http://sources.xes-inc.com/downloads/corosync.nov2.coredump
>
>
>
># corosync -f
>notice [MAIN ] Corosync Cluster Engine ('2.1.0'): started and ready to provide 
>service.
>info [MAIN ] Corosync built-in features: pie relro bindnow
>Bus error (core dumped)
>
>
>Here's the log: http://pastebin.com/bUfiB3T3
>
>
>Did your analysis of the core dump reveal anything?
>

I can't get any symbols out of these coredumps. Can you try get a backtrace?

>
>Is there a way for me to make it generate fdata with a bus error, or how else 
>can I gather additional information to help debug this?
>

if you look in exec/main.c and look for SIGSEGV you will see how the mechanism
for fdata works. Just and a handler for SIGBUS and hook it up. Then you should
be able to get the fdata for both.

I'd rather be able to get a backtrace if possible.

-Angus

>
>Thanks,
>
>
>Andrew
>
>----- Original Message -----
>
>From: "Angus Salkeld" <asalk...@redhat.com>
>To: pacemaker@oss.clusterlabs.org, disc...@corosync.org
>Sent: Thursday, November 1, 2012 5:47:16 PM
>Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in 
>cluster
>
>On 01/11/12 17:27 -0500, Andrew Martin wrote:
>>Hi Angus,
>>
>>
>>I'll try upgrading to the latest libqb tomorrow and see if I can reproduce 
>>this behavior with it. I was able to get a coredump by running corosync 
>>manually in the foreground (corosync -f):
>>http://sources.xes-inc.com/downloads/corosync.coredump
>
>Thanks, looking...
>
>>
>>
>>There still isn't anything added to /var/lib/corosync however. What do I need 
>>to do to enable the fdata file to be created?
>
>Well if it crashes with SIGSEGV it will generate it automatically.
>(I see you are getting a bus error) - :(.
>
>-A
>
>>
>>
>>Thanks,
>>
>>Andrew
>>
>>----- Original Message -----
>>
>>From: "Angus Salkeld" <asalk...@redhat.com>
>>To: pacemaker@oss.clusterlabs.org, disc...@corosync.org
>>Sent: Thursday, November 1, 2012 5:11:23 PM
>>Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in 
>>cluster
>>
>>On 01/11/12 14:32 -0500, Andrew Martin wrote:
>>>Hi Honza,
>>>
>>>
>>>Thanks for the help. I enabled core dumps in /etc/security/limits.conf but 
>>>didn't have a chance to reboot and apply the changes so I don't have a core 
>>>dump this time. Do core dumps need to be enabled for the fdata-DATETIME-PID 
>>>file to be generated? right now all that is in /var/lib/corosync are the 
>>>ringid_XXX files. Do I need to set something explicitly in the corosync 
>>>config to enable this logging?
>>>
>>>
>>>I did find find something else interesting with libqb this time. I compiled 
>>>libqb 0.14.2 for use with the cluster. This time when corosync died I 
>>>noticed the following in dmesg:
>>>Nov 1 13:21:01 storage1 kernel: [31036.617236] corosync[13305] trap divide 
>>>error ip:7f657a52e517 sp:7fffd5068858 error:0 in 
>>>libqb.so.0.14.2[7f657a525000+1f000]
>>>This error was only present for one of the many other times corosync has 
>>>died.
>>>
>>>
>>>I see that there is a newer version of libqb (0.14.3) out, but didn't see a 
>>>fix for this particular bug. Could this libqb problem be related to the 
>>>corosync to hang up? Here's the corresponding corosync log file (next time I 
>>>should have a core dump as well):
>>>http://pastebin.com/5FLKg7We
>>
>>Hi Andrew
>>
>>I can't see much wrong with the log either. If you could run with the latest
>>(libqb-0.14.3) and post a backtrace if it still happens, that would be great.
>>
>>Thanks
>>Angus
>>
>>>
>>>
>>>Thanks,
>>>
>>>
>>>Andrew
>>>
>>>----- Original Message -----
>>>
>>>From: "Jan Friesse" <jfrie...@redhat.com>
>>>To: "Andrew Martin" <amar...@xes-inc.com>
>>>Cc: disc...@corosync.org, "The Pacemaker cluster resource manager" 
>>><pacemaker@oss.clusterlabs.org>
>>>Sent: Thursday, November 1, 2012 7:55:52 AM
>>>Subject: Re: [corosync] Corosync 2.1.0 dies on both nodes in cluster
>>>
>>>Ansdrew,
>>>I was not able to find anything interesting (from corosync point of
>>>view) in configuration/logs (corosync related).
>>>
>>>What would be helpful:
>>>- if corosync died, there should be
>>>/var/lib/corosync/fdata-DATETTIME-PID of dead corosync. Can you please
>>>xz them and store somewhere (they are quiet large but well compressible).
>>>- If you are able to reproduce problem (what seems like you are), can
>>>you please allow generating of coredumps and store somewhere backtrace
>>>of coredump? (coredumps are stored in /var/lib/corosync as core.PID, and
>>>way to obtain coredump is gdb corosync /var/lib/corosync/core.pid, and
>>>here thread apply all bt). If you are running distribution with ABRT
>>>support, you can also use ABRT to generate report.
>>>
>>>Regards,
>>>Honza
>>>
>>>Andrew Martin napsal(a):
>>>> Corosync died an additional 3 times during the night on storage1. I wrote 
>>>> a daemon to attempt and start it as soon as it fails, so only one of those 
>>>> times resulted in a STONITH of storage1.
>>>>
>>>> I enabled debug in the corosync config, so I was able to capture a period 
>>>> when corosync died with debug output:
>>>> http://pastebin.com/eAmJSmsQ
>>>> In this example, Pacemaker finishes shutting down by Nov 01 05:53:02. For 
>>>> reference, here is my Pacemaker configuration:
>>>> http://pastebin.com/DFL3hNvz
>>>>
>>>> It seems that an extra node, 16777343 "localhost" has been added to the 
>>>> cluster after storage1 was STONTIHed (must be the localhost interface on 
>>>> storage1). Is there anyway to prevent this?
>>>>
>>>> Does this help to determine why corosync is dying, and what I can do to 
>>>> fix it?
>>>>
>>>> Thanks,
>>>>
>>>> Andrew
>>>>
>>>> ----- Original Message -----
>>>>
>>>> From: "Andrew Martin" <amar...@xes-inc.com>
>>>> To: disc...@corosync.org
>>>> Sent: Thursday, November 1, 2012 12:11:35 AM
>>>> Subject: [corosync] Corosync 2.1.0 dies on both nodes in cluster
>>>>
>>>>
>>>> Hello,
>>>>
>>>> I recently configured a 3-node fileserver cluster by building Corosync 
>>>> 2.1.0 and Pacemaker 1.1.8 from source. All of the nodes are running Ubuntu 
>>>> 12.04 amd64. Two of the nodes (storage0 and storage1) are "real" nodes 
>>>> where the resources run (a DRBD disk, filesystem mount, and samba/nfs 
>>>> daemons), while the third node (storagequorum) is in standby mode and acts 
>>>> as a quorum node for the cluster. Today I discovered that corosync died on 
>>>> both storage0 and storage1 at the same time. Since corosync died, 
>>>> pacemaker shut down as well on both nodes. Because the cluster no longer 
>>>> had quorum (and the no-quorum-policy="freeze"), storagequorum was unable 
>>>> to STONITH either node and just left the resources frozen where they were 
>>>> running, on storage0. I cannot find any log information to determine why 
>>>> corosync crashed, and this is a disturbing problem as the cluster and its 
>>>> messaging layer must be stable. Below is my corosync configuration file as 
>>>> well as the corosync log file from each!
!
>n!
>>o!
>>>de during
>>>this period.
>>>>
>>>> corosync.conf:
>>>> http://pastebin.com/vWQDVmg8
>>>> Note that I have two redundant rings. On one of them, I specify the IP 
>>>> address (in this example 10.10.10.7) so that it binds to the correct 
>>>> interface (since potentially in the future those machines may have two 
>>>> interfaces on the same subnet).
>>>>
>>>> corosync.log from storage0:
>>>> http://pastebin.com/HK8KYDDQ
>>>>
>>>> corosync.log from storage1:
>>>> http://pastebin.com/sDWkcPUz
>>>>
>>>> corosync.log from storagequorum (the DC during this period):
>>>> http://pastebin.com/uENQ5fnf
>>>>
>>>> Issuing service corosync start && service pacemaker start on storage0 and 
>>>> storage1 resolved the problem and allowed the nodes to successfully 
>>>> reconnect to the cluster. What other information can I provide to help 
>>>> diagnose this problem and prevent it from recurring?
>>>>
>>>> Thanks,
>>>>
>>>> Andrew Martin
>>>>
>>>> _______________________________________________
>>>> discuss mailing list
>>>> disc...@corosync.org
>>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> discuss mailing list
>>>> disc...@corosync.org
>>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>
>>>
>>
>>>_______________________________________________
>>>Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>>Project Home: http://www.clusterlabs.org
>>>Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>Bugs: http://bugs.clusterlabs.org
>>
>>
>>_______________________________________________
>>discuss mailing list
>>disc...@corosync.org
>>http://lists.corosync.org/mailman/listinfo/discuss
>>
>
>_______________________________________________
>discuss mailing list
>disc...@corosync.org
>http://lists.corosync.org/mailman/listinfo/discuss
>

_______________________________________________
discuss mailing list
disc...@corosync.org
http://lists.corosync.org/mailman/listinfo/discuss


_______________________________________________
discuss mailing list
disc...@corosync.org
http://lists.corosync.org/mailman/listinfo/discuss

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

Reply via email to