Corosync died an additional 3 times during the night on storage1. I wrote a 
daemon to attempt and start it as soon as it fails, so only one of those times 
resulted in a STONITH of storage1.

I enabled debug in the corosync config, so I was able to capture a period when 
corosync died with debug output:
http://pastebin.com/eAmJSmsQ
In this example, Pacemaker finishes shutting down by Nov 01 05:53:02. For 
reference, here is my Pacemaker configuration:
http://pastebin.com/DFL3hNvz

It seems that an extra node, 16777343 "localhost" has been added to the cluster 
after storage1 was STONTIHed (must be the localhost interface on storage1). Is 
there anyway to prevent this?

Does this help to determine why corosync is dying, and what I can do to fix it?

Thanks,

Andrew

----- Original Message -----

From: "Andrew Martin" <amar...@xes-inc.com>
To: disc...@corosync.org
Sent: Thursday, November 1, 2012 12:11:35 AM
Subject: [corosync] Corosync 2.1.0 dies on both nodes in cluster


Hello,

I recently configured a 3-node fileserver cluster by building Corosync 2.1.0 
and Pacemaker 1.1.8 from source. All of the nodes are running Ubuntu 12.04 
amd64. Two of the nodes (storage0 and storage1) are "real" nodes where the 
resources run (a DRBD disk, filesystem mount, and samba/nfs daemons), while the 
third node (storagequorum) is in standby mode and acts as a quorum node for the 
cluster. Today I discovered that corosync died on both storage0 and storage1 at 
the same time. Since corosync died, pacemaker shut down as well on both nodes. 
Because the cluster no longer had quorum (and the no-quorum-policy="freeze"), 
storagequorum was unable to STONITH either node and just left the resources 
frozen where they were running, on storage0. I cannot find any log information 
to determine why corosync crashed, and this is a disturbing problem as the 
cluster and its messaging layer must be stable. Below is my corosync 
configuration file as well as the corosync log file from each node during this 
period.

corosync.conf:
http://pastebin.com/vWQDVmg8
Note that I have two redundant rings. On one of them, I specify the IP address 
(in this example 10.10.10.7) so that it binds to the correct interface (since 
potentially in the future those machines may have two interfaces on the same 
subnet).

corosync.log from storage0:
http://pastebin.com/HK8KYDDQ

corosync.log from storage1:
http://pastebin.com/sDWkcPUz

corosync.log from storagequorum (the DC during this period):
http://pastebin.com/uENQ5fnf

Issuing service corosync start && service pacemaker start on storage0 and 
storage1 resolved the problem and allowed the nodes to successfully reconnect 
to the cluster. What other information can I provide to help diagnose this 
problem and prevent it from recurring?

Thanks,

Andrew Martin

_______________________________________________
discuss mailing list
disc...@corosync.org
http://lists.corosync.org/mailman/listinfo/discuss

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to