On 2013-03-22 03:39, pacema...@feystorm.net wrote: > > On 03/21/2013 11:15 AM, Andreas Kurz wrote: >> On 2013-03-21 14:31, Patrick Hemmer wrote: >>> I've got a 2-node cluster where it seems last night one of the nodes >>> went offline, and I can't see any reason why. >>> >>> Attached are the logs from the 2 nodes (the relevant timeframe seems to >>> be 2013-03-21 between 06:05 and 06:10). >>> This is on ubuntu 12.04 > >> Looks like your non-redundant cluster-communication was interrupted at >> around that time for whatever reason and your cluster split-brained. > >> Does the drbd-replication use a different network-connection? If yes, >> why not using it for a redundant ring setup ... and you should use > STONITH. > >> I also wonder why you have defined "expected_votes='1'" in your >> cluster.conf. > >> Regards, >> Andreas > But shouldn't it have recovered? The node shows as "OFFLINE", even > though it's clearly communicating with the rest of the cluster. What is > the procedure for getting the node back online. Anything other than > bouncing pacemaker?
Looks like the cluster has some troubles trying to rejoin the two DCs after the split-brain. Try to stop cman/Pacemaker on i-3307d96b and clean there the /var/lib/heartbeat/crm directory so it starts with an empty configuration and receives the latest updates from i-a706d8ff. > > Unfortunately no to the different network connection for drbd. These are > 2 EC2 instances, so redundant connections aren't available. Though since > it is EC2, I could set up a STONITH to whack the other instance. The > only problem here would be a race condition. The EC2 api for shutting > down or rebooting an instance isn't instantaneous. Both nodes could end > up sending the signal to reboot the other node. Yeah, you would need to add a very generous start-timeout to the monitor operation of the stonith primitive ... but it works ;-) > > As for expected_votes=1, it's because it's a two-node cluster. Though I > apparently forgot to set the `two_node` attribute :-( Those two parameters should not be needed for a cman/pacemaker cluster, you can tell pacemaker to ignore loss of quorum. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org