Hi Angus,
Corosync died again while using libqb 0.14.3. Here is the coredump from today: http://sources.xes-inc.com/downloads/corosync.nov2.coredump # corosync -f notice [MAIN ] Corosync Cluster Engine ('2.1.0'): started and ready to provide service. info [MAIN ] Corosync built-in features: pie relro bindnow Bus error (core dumped) Here's the log: http://pastebin.com/bUfiB3T3 Did your analysis of the core dump reveal anything? Is there a way for me to make it generate fdata with a bus error, or how else can I gather additional information to help debug this? Thanks, Andrew ----- Original Message ----- From: "Angus Salkeld" <asalk...@redhat.com> To: pacemaker@oss.clusterlabs.org, disc...@corosync.org Sent: Thursday, November 1, 2012 5:47:16 PM Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in cluster On 01/11/12 17:27 -0500, Andrew Martin wrote: >Hi Angus, > > >I'll try upgrading to the latest libqb tomorrow and see if I can reproduce >this behavior with it. I was able to get a coredump by running corosync >manually in the foreground (corosync -f): >http://sources.xes-inc.com/downloads/corosync.coredump Thanks, looking... > > >There still isn't anything added to /var/lib/corosync however. What do I need >to do to enable the fdata file to be created? Well if it crashes with SIGSEGV it will generate it automatically. (I see you are getting a bus error) - :(. -A > > >Thanks, > >Andrew > >----- Original Message ----- > >From: "Angus Salkeld" <asalk...@redhat.com> >To: pacemaker@oss.clusterlabs.org, disc...@corosync.org >Sent: Thursday, November 1, 2012 5:11:23 PM >Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in >cluster > >On 01/11/12 14:32 -0500, Andrew Martin wrote: >>Hi Honza, >> >> >>Thanks for the help. I enabled core dumps in /etc/security/limits.conf but >>didn't have a chance to reboot and apply the changes so I don't have a core >>dump this time. Do core dumps need to be enabled for the fdata-DATETIME-PID >>file to be generated? right now all that is in /var/lib/corosync are the >>ringid_XXX files. Do I need to set something explicitly in the corosync >>config to enable this logging? >> >> >>I did find find something else interesting with libqb this time. I compiled >>libqb 0.14.2 for use with the cluster. This time when corosync died I noticed >>the following in dmesg: >>Nov 1 13:21:01 storage1 kernel: [31036.617236] corosync[13305] trap divide >>error ip:7f657a52e517 sp:7fffd5068858 error:0 in >>libqb.so.0.14.2[7f657a525000+1f000] >>This error was only present for one of the many other times corosync has died. >> >> >>I see that there is a newer version of libqb (0.14.3) out, but didn't see a >>fix for this particular bug. Could this libqb problem be related to the >>corosync to hang up? Here's the corresponding corosync log file (next time I >>should have a core dump as well): >>http://pastebin.com/5FLKg7We > >Hi Andrew > >I can't see much wrong with the log either. If you could run with the latest >(libqb-0.14.3) and post a backtrace if it still happens, that would be great. > >Thanks >Angus > >> >> >>Thanks, >> >> >>Andrew >> >>----- Original Message ----- >> >>From: "Jan Friesse" <jfrie...@redhat.com> >>To: "Andrew Martin" <amar...@xes-inc.com> >>Cc: disc...@corosync.org, "The Pacemaker cluster resource manager" >><pacemaker@oss.clusterlabs.org> >>Sent: Thursday, November 1, 2012 7:55:52 AM >>Subject: Re: [corosync] Corosync 2.1.0 dies on both nodes in cluster >> >>Ansdrew, >>I was not able to find anything interesting (from corosync point of >>view) in configuration/logs (corosync related). >> >>What would be helpful: >>- if corosync died, there should be >>/var/lib/corosync/fdata-DATETTIME-PID of dead corosync. Can you please >>xz them and store somewhere (they are quiet large but well compressible). >>- If you are able to reproduce problem (what seems like you are), can >>you please allow generating of coredumps and store somewhere backtrace >>of coredump? (coredumps are stored in /var/lib/corosync as core.PID, and >>way to obtain coredump is gdb corosync /var/lib/corosync/core.pid, and >>here thread apply all bt). If you are running distribution with ABRT >>support, you can also use ABRT to generate report. >> >>Regards, >>Honza >> >>Andrew Martin napsal(a): >>> Corosync died an additional 3 times during the night on storage1. I wrote a >>> daemon to attempt and start it as soon as it fails, so only one of those >>> times resulted in a STONITH of storage1. >>> >>> I enabled debug in the corosync config, so I was able to capture a period >>> when corosync died with debug output: >>> http://pastebin.com/eAmJSmsQ >>> In this example, Pacemaker finishes shutting down by Nov 01 05:53:02. For >>> reference, here is my Pacemaker configuration: >>> http://pastebin.com/DFL3hNvz >>> >>> It seems that an extra node, 16777343 "localhost" has been added to the >>> cluster after storage1 was STONTIHed (must be the localhost interface on >>> storage1). Is there anyway to prevent this? >>> >>> Does this help to determine why corosync is dying, and what I can do to fix >>> it? >>> >>> Thanks, >>> >>> Andrew >>> >>> ----- Original Message ----- >>> >>> From: "Andrew Martin" <amar...@xes-inc.com> >>> To: disc...@corosync.org >>> Sent: Thursday, November 1, 2012 12:11:35 AM >>> Subject: [corosync] Corosync 2.1.0 dies on both nodes in cluster >>> >>> >>> Hello, >>> >>> I recently configured a 3-node fileserver cluster by building Corosync >>> 2.1.0 and Pacemaker 1.1.8 from source. All of the nodes are running Ubuntu >>> 12.04 amd64. Two of the nodes (storage0 and storage1) are "real" nodes >>> where the resources run (a DRBD disk, filesystem mount, and samba/nfs >>> daemons), while the third node (storagequorum) is in standby mode and acts >>> as a quorum node for the cluster. Today I discovered that corosync died on >>> both storage0 and storage1 at the same time. Since corosync died, pacemaker >>> shut down as well on both nodes. Because the cluster no longer had quorum >>> (and the no-quorum-policy="freeze"), storagequorum was unable to STONITH >>> either node and just left the resources frozen where they were running, on >>> storage0. I cannot find any log information to determine why corosync >>> crashed, and this is a disturbing problem as the cluster and its messaging >>> layer must be stable. Below is my corosync configuration file as well as >>> the corosync log file from each ! n! >o! >>de during >>this period. >>> >>> corosync.conf: >>> http://pastebin.com/vWQDVmg8 >>> Note that I have two redundant rings. On one of them, I specify the IP >>> address (in this example 10.10.10.7) so that it binds to the correct >>> interface (since potentially in the future those machines may have two >>> interfaces on the same subnet). >>> >>> corosync.log from storage0: >>> http://pastebin.com/HK8KYDDQ >>> >>> corosync.log from storage1: >>> http://pastebin.com/sDWkcPUz >>> >>> corosync.log from storagequorum (the DC during this period): >>> http://pastebin.com/uENQ5fnf >>> >>> Issuing service corosync start && service pacemaker start on storage0 and >>> storage1 resolved the problem and allowed the nodes to successfully >>> reconnect to the cluster. What other information can I provide to help >>> diagnose this problem and prevent it from recurring? >>> >>> Thanks, >>> >>> Andrew Martin >>> >>> _______________________________________________ >>> discuss mailing list >>> disc...@corosync.org >>> http://lists.corosync.org/mailman/listinfo/discuss >>> >>> >>> >>> >>> >>> _______________________________________________ >>> discuss mailing list >>> disc...@corosync.org >>> http://lists.corosync.org/mailman/listinfo/discuss >> >> > >>_______________________________________________ >>Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >>Project Home: http://www.clusterlabs.org >>Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>Bugs: http://bugs.clusterlabs.org > > >_______________________________________________ >discuss mailing list >disc...@corosync.org >http://lists.corosync.org/mailman/listinfo/discuss > _______________________________________________ discuss mailing list disc...@corosync.org http://lists.corosync.org/mailman/listinfo/discuss
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org