Re: [Pacemaker] pacemaker node stuck offline
On 2013-03-22 03:39, pacema...@feystorm.net wrote: On 03/21/2013 11:15 AM, Andreas Kurz wrote: On 2013-03-21 14:31, Patrick Hemmer wrote: I've got a 2-node cluster where it seems last night one of the nodes went offline, and I can't see any reason why. Attached are the logs from the 2 nodes (the relevant timeframe seems to be 2013-03-21 between 06:05 and 06:10). This is on ubuntu 12.04 Looks like your non-redundant cluster-communication was interrupted at around that time for whatever reason and your cluster split-brained. Does the drbd-replication use a different network-connection? If yes, why not using it for a redundant ring setup ... and you should use STONITH. I also wonder why you have defined expected_votes='1' in your cluster.conf. Regards, Andreas But shouldn't it have recovered? The node shows as OFFLINE, even though it's clearly communicating with the rest of the cluster. What is the procedure for getting the node back online. Anything other than bouncing pacemaker? Looks like the cluster has some troubles trying to rejoin the two DCs after the split-brain. Try to stop cman/Pacemaker on i-3307d96b and clean there the /var/lib/heartbeat/crm directory so it starts with an empty configuration and receives the latest updates from i-a706d8ff. Unfortunately no to the different network connection for drbd. These are 2 EC2 instances, so redundant connections aren't available. Though since it is EC2, I could set up a STONITH to whack the other instance. The only problem here would be a race condition. The EC2 api for shutting down or rebooting an instance isn't instantaneous. Both nodes could end up sending the signal to reboot the other node. Yeah, you would need to add a very generous start-timeout to the monitor operation of the stonith primitive ... but it works ;-) As for expected_votes=1, it's because it's a two-node cluster. Though I apparently forgot to set the `two_node` attribute :-( Those two parameters should not be needed for a cman/pacemaker cluster, you can tell pacemaker to ignore loss of quorum. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now signature.asc Description: OpenPGP digital signature ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemaker node stuck offline
Am 22.03.2013 um 03:39 schrieb pacema...@feystorm.net: Looks like your non-redundant cluster-communication was interrupted at around that time for whatever reason and your cluster split-brained. Does the drbd-replication use a different network-connection? If yes, why not using it for a redundant ring setup ... and you should use STONITH. I also wonder why you have defined expected_votes='1' in your cluster.conf. But shouldn't it have recovered? The node shows as OFFLINE, even though it's clearly communicating with the rest of the cluster. What is the procedure for getting the node back online. Anything other than bouncing pacemaker? crm node online nodeX.localdomain ? -- LF ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemaker node stuck offline
That's only for use when the node is in standby. It's in offline, not standby. Though for gits and shiggles I tried it anyway. Nothing. -Patrick On 2013/22/03 06:07, Leon Fauster wrote: Am 22.03.2013 um 03:39 schrieb pacema...@feystorm.net: Looks like your non-redundant cluster-communication was interrupted at around that time for whatever reason and your cluster split-brained. Does the drbd-replication use a different network-connection? If yes, why not using it for a redundant ring setup ... and you should use STONITH. I also wonder why you have defined expected_votes='1' in your cluster.conf. But shouldn't it have recovered? The node shows as OFFLINE, even though it's clearly communicating with the rest of the cluster. What is the procedure for getting the node back online. Anything other than bouncing pacemaker? crm node online nodeX.localdomain ? -- LF ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemaker node stuck offline
On 2013-03-21 14:31, Patrick Hemmer wrote: I've got a 2-node cluster where it seems last night one of the nodes went offline, and I can't see any reason why. Attached are the logs from the 2 nodes (the relevant timeframe seems to be 2013-03-21 between 06:05 and 06:10). This is on ubuntu 12.04 Looks like your non-redundant cluster-communication was interrupted at around that time for whatever reason and your cluster split-brained. Does the drbd-replication use a different network-connection? If yes, why not using it for a redundant ring setup ... and you should use STONITH. I also wonder why you have defined expected_votes='1' in your cluster.conf. Regards, Andreas -- Need help with Pacemaker? http://www.hastexo.com/now # crm status Last updated: Thu Mar 21 13:17:21 2013 Last change: Thu Mar 14 14:42:18 2013 via crm_shadow on i-a706d8ff Stack: cman Current DC: i-a706d8ff - partition WITHOUT quorum Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c 2 Nodes configured, unknown expected votes 5 Resources configured. Online: [ i-a706d8ff ] OFFLINE: [ i-3307d96b ] dns-postgresql(ocf::cloud:route53):Started i-a706d8ff Master/Slave Set: ms-drbd-postgresql [drbd-postgresql] Masters: [ i-a706d8ff ] Stopped: [ drbd-postgresql:0 ] fs-drbd-postgresql(ocf::heartbeat:Filesystem):Started i-a706d8ff postgresql(ocf::heartbeat:pgsql):Started i-a706d8ff # cman_tool nodes Node Sts Inc Joined Name 181480898 M 4 2013-03-14 14:25:27 i-3307d96b 181481642 M 5132 2013-03-21 06:07:40 i-a706d8ff # cman_tool status Version: 6.2.0 Config Version: 1 Cluster Name: cloudapp-servic Cluster Id: 63629 Cluster Member: Yes Cluster Generation: 5132 Membership state: Cluster-Member Nodes: 2 Expected votes: 1 Total votes: 2 Node votes: 1 Quorum: 2 Active subsystems: 4 Flags: Ports Bound: 0 Node name: i-3307d96b Node ID: 181480898 Multicast addresses: 255.255.255.255 Node addresses: 10.209.45.194 # cat /etc/cluster/cluster.conf ?xml version=1.0 ? cluster name='cloudapp-servic' config_version='1' logging to_logfile='no' syslog_facility='local2' syslog_priority='debug' / cman expected_votes='1' transport='udpu' / clusternodes clusternode nodeid='181480898' name='i-3307d96b' fence method name='pcmk-redirect' device name='pcmk' port='i-3307d96b' / /method /fence /clusternode clusternode nodeid='181481642' name='i-a706d8ff' fence method name='pcmk-redirect' device name='pcmk' port='i-a706d8ff' / /method /fence /clusternode /clusternodes fencedevices fencedevice name=pcmk agent=fence_pcmk / /fencedevices /cluster ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org signature.asc Description: OpenPGP digital signature ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [Pacemaker] pacemaker node stuck offline
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 03/21/2013 11:15 AM, Andreas Kurz wrote: On 2013-03-21 14:31, Patrick Hemmer wrote: I've got a 2-node cluster where it seems last night one of the nodes went offline, and I can't see any reason why. Attached are the logs from the 2 nodes (the relevant timeframe seems to be 2013-03-21 between 06:05 and 06:10). This is on ubuntu 12.04 Looks like your non-redundant cluster-communication was interrupted at around that time for whatever reason and your cluster split-brained. Does the drbd-replication use a different network-connection? If yes, why not using it for a redundant ring setup ... and you should use STONITH. I also wonder why you have defined expected_votes='1' in your cluster.conf. Regards, Andreas But shouldn't it have recovered? The node shows as OFFLINE, even though it's clearly communicating with the rest of the cluster. What is the procedure for getting the node back online. Anything other than bouncing pacemaker? Unfortunately no to the different network connection for drbd. These are 2 EC2 instances, so redundant connections aren't available. Though since it is EC2, I could set up a STONITH to whack the other instance. The only problem here would be a race condition. The EC2 api for shutting down or rebooting an instance isn't instantaneous. Both nodes could end up sending the signal to reboot the other node. As for expected_votes=1, it's because it's a two-node cluster. Though I apparently forgot to set the `two_node` attribute :-( - -Patrick -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.19 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJRS8RSAAoJED0CF0ckHb4J5/4IAIBTh92ySD9NatBjanOtvwIZ G7ldoPD/o//pOD8A76ZzJnbN+m5PQ1cykpwuC6j+l+fHbkYlDHYEnjbrdRS2dJFY i1PibEIIOjeEAiK9PmCphKQ2qbkrKJXB0QdFD0EZjFFeatNfx/MBHInTBVdFa5MI wZ19qcNELxHZHsrAfgFxYGzKvA1mCVZuRhFXpMoZJ9vo3RUFT1GaLbLA/k8+NHgQ qPbmiYR0RI1cB+HqWl/Hn+PpWnV9zrF/vcZXISHp+cWpZ+IxzmDowR6iIHP+tC7N AslkXAfz4BlH0cuM2kjA9ZdkApzGttH7GkMyOrOQ4Rv8rV4teQjMtPogMcqdFuc= =lYXu -END PGP SIGNATURE- ___ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org