Re: [Pacemaker] pacemaker node stuck offline

2013-03-25 Thread Andreas Kurz
On 2013-03-22 03:39, pacema...@feystorm.net wrote:
 
 On 03/21/2013 11:15 AM, Andreas Kurz wrote:
 On 2013-03-21 14:31, Patrick Hemmer wrote:
 I've got a 2-node cluster where it seems last night one of the nodes
 went offline, and I can't see any reason why.

 Attached are the logs from the 2 nodes (the relevant timeframe seems to
 be 2013-03-21 between 06:05 and 06:10).
 This is on ubuntu 12.04
 
 Looks like your non-redundant cluster-communication was interrupted at
 around that time for whatever reason and your cluster split-brained.
 
 Does the drbd-replication use a different network-connection? If yes,
 why not using it for a redundant ring setup ... and you should use
 STONITH.
 
 I also wonder why you have defined expected_votes='1' in your
 cluster.conf.
 
 Regards,
 Andreas
 But shouldn't it have recovered? The node shows as OFFLINE, even
 though it's clearly communicating with the rest of the cluster. What is
 the procedure for getting the node back online. Anything other than
 bouncing pacemaker?

Looks like the cluster has some troubles trying to rejoin the two DCs
after the split-brain. Try to stop cman/Pacemaker on i-3307d96b and
clean there the /var/lib/heartbeat/crm directory so it starts with an
empty configuration and receives the latest updates from i-a706d8ff.

 
 Unfortunately no to the different network connection for drbd. These are
 2 EC2 instances, so redundant connections aren't available. Though since
 it is EC2, I could set up a STONITH to whack the other instance. The
 only problem here would be a race condition. The EC2 api for shutting
 down or rebooting an instance isn't instantaneous. Both nodes could end
 up sending the signal to reboot the other node.

Yeah, you would need to add a very generous start-timeout to the monitor
operation of the stonith primitive ... but it works ;-)

 
 As for expected_votes=1, it's because it's a two-node cluster. Though I
 apparently forgot to set the `two_node` attribute :-(

Those two parameters should not be needed for a cman/pacemaker cluster,
you can tell pacemaker to ignore loss of quorum.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now




signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker node stuck offline

2013-03-22 Thread Leon Fauster
Am 22.03.2013 um 03:39 schrieb pacema...@feystorm.net:
 
 
 Looks like your non-redundant cluster-communication was interrupted at
 around that time for whatever reason and your cluster split-brained.
 
 Does the drbd-replication use a different network-connection? If yes,
 why not using it for a redundant ring setup ... and you should use STONITH.
 
 I also wonder why you have defined expected_votes='1' in your
 cluster.conf.
 
 But shouldn't it have recovered? The node shows as OFFLINE, even
 though it's clearly communicating with the rest of the cluster. What is
 the procedure for getting the node back online. Anything other than
 bouncing pacemaker?


crm node online nodeX.localdomain ?

--
LF



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker node stuck offline

2013-03-22 Thread pacemaker
That's only for use when the node is in standby. It's in offline, not
standby. Though for gits and shiggles I tried it anyway. Nothing.

-Patrick


On 2013/22/03 06:07, Leon Fauster wrote:
 Am 22.03.2013 um 03:39 schrieb pacema...@feystorm.net:
 Looks like your non-redundant cluster-communication was interrupted at
 around that time for whatever reason and your cluster split-brained.

 Does the drbd-replication use a different network-connection? If yes,
 why not using it for a redundant ring setup ... and you should use STONITH.

 I also wonder why you have defined expected_votes='1' in your
 cluster.conf.

 But shouldn't it have recovered? The node shows as OFFLINE, even
 though it's clearly communicating with the rest of the cluster. What is
 the procedure for getting the node back online. Anything other than
 bouncing pacemaker?

 crm node online nodeX.localdomain ?

 --
 LF



 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker node stuck offline

2013-03-21 Thread Andreas Kurz
On 2013-03-21 14:31, Patrick Hemmer wrote:
 I've got a 2-node cluster where it seems last night one of the nodes
 went offline, and I can't see any reason why.
 
 Attached are the logs from the 2 nodes (the relevant timeframe seems to
 be 2013-03-21 between 06:05 and 06:10).
 This is on ubuntu 12.04

Looks like your non-redundant cluster-communication was interrupted at
around that time for whatever reason and your cluster split-brained.

Does the drbd-replication use a different network-connection? If yes,
why not using it for a redundant ring setup ... and you should use STONITH.

I also wonder why you have defined expected_votes='1' in your
cluster.conf.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now


 
 # crm status
 
 Last updated: Thu Mar 21 13:17:21 2013
 Last change: Thu Mar 14 14:42:18 2013 via crm_shadow on i-a706d8ff
 Stack: cman
 Current DC: i-a706d8ff - partition WITHOUT quorum
 Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
 2 Nodes configured, unknown expected votes
 5 Resources configured.
 
 
 Online: [ i-a706d8ff ]
 OFFLINE: [ i-3307d96b ]
 
  dns-postgresql(ocf::cloud:route53):Started i-a706d8ff
  Master/Slave Set: ms-drbd-postgresql [drbd-postgresql]
  Masters: [ i-a706d8ff ]
  Stopped: [ drbd-postgresql:0 ]
  fs-drbd-postgresql(ocf::heartbeat:Filesystem):Started i-a706d8ff
  postgresql(ocf::heartbeat:pgsql):Started i-a706d8ff
 
 
 # cman_tool nodes
 Node  Sts   Inc   Joined   Name
 181480898   M  4   2013-03-14 14:25:27  i-3307d96b
 181481642   M   5132   2013-03-21 06:07:40  i-a706d8ff
 
 
 # cman_tool status
 Version: 6.2.0
 Config Version: 1
 Cluster Name: cloudapp-servic
 Cluster Id: 63629
 Cluster Member: Yes
 Cluster Generation: 5132
 Membership state: Cluster-Member
 Nodes: 2
 Expected votes: 1
 Total votes: 2
 Node votes: 1
 Quorum: 2 
 Active subsystems: 4
 Flags:
 Ports Bound: 0 
 Node name: i-3307d96b
 Node ID: 181480898
 Multicast addresses: 255.255.255.255
 Node addresses: 10.209.45.194
 
 
 
 # cat /etc/cluster/cluster.conf
 ?xml version=1.0 ?
 cluster name='cloudapp-servic' config_version='1'
 logging to_logfile='no' syslog_facility='local2'
 syslog_priority='debug' /
 cman expected_votes='1' transport='udpu' /
 clusternodes
 clusternode nodeid='181480898' name='i-3307d96b'
 fence
 method name='pcmk-redirect'
 device name='pcmk' port='i-3307d96b' /
 /method
 /fence
 /clusternode
 clusternode nodeid='181481642' name='i-a706d8ff'
 fence
 method name='pcmk-redirect'
 device name='pcmk' port='i-a706d8ff' /
 /method
 /fence
 /clusternode
 /clusternodes
 
 fencedevices
 fencedevice name=pcmk agent=fence_pcmk /
 /fencedevices
 /cluster
 
 
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 





signature.asc
Description: OpenPGP digital signature
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker node stuck offline

2013-03-21 Thread pacemaker

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 03/21/2013 11:15 AM, Andreas Kurz wrote:
 On 2013-03-21 14:31, Patrick Hemmer wrote:
 I've got a 2-node cluster where it seems last night one of the nodes
 went offline, and I can't see any reason why.

 Attached are the logs from the 2 nodes (the relevant timeframe seems to
 be 2013-03-21 between 06:05 and 06:10).
 This is on ubuntu 12.04

 Looks like your non-redundant cluster-communication was interrupted at
 around that time for whatever reason and your cluster split-brained.

 Does the drbd-replication use a different network-connection? If yes,
 why not using it for a redundant ring setup ... and you should use
STONITH.

 I also wonder why you have defined expected_votes='1' in your
 cluster.conf.

 Regards,
 Andreas
But shouldn't it have recovered? The node shows as OFFLINE, even
though it's clearly communicating with the rest of the cluster. What is
the procedure for getting the node back online. Anything other than
bouncing pacemaker?

Unfortunately no to the different network connection for drbd. These are
2 EC2 instances, so redundant connections aren't available. Though since
it is EC2, I could set up a STONITH to whack the other instance. The
only problem here would be a race condition. The EC2 api for shutting
down or rebooting an instance isn't instantaneous. Both nodes could end
up sending the signal to reboot the other node.

As for expected_votes=1, it's because it's a two-node cluster. Though I
apparently forgot to set the `two_node` attribute :-(

- -Patrick
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.19 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJRS8RSAAoJED0CF0ckHb4J5/4IAIBTh92ySD9NatBjanOtvwIZ
G7ldoPD/o//pOD8A76ZzJnbN+m5PQ1cykpwuC6j+l+fHbkYlDHYEnjbrdRS2dJFY
i1PibEIIOjeEAiK9PmCphKQ2qbkrKJXB0QdFD0EZjFFeatNfx/MBHInTBVdFa5MI
wZ19qcNELxHZHsrAfgFxYGzKvA1mCVZuRhFXpMoZJ9vo3RUFT1GaLbLA/k8+NHgQ
qPbmiYR0RI1cB+HqWl/Hn+PpWnV9zrF/vcZXISHp+cWpZ+IxzmDowR6iIHP+tC7N
AslkXAfz4BlH0cuM2kjA9ZdkApzGttH7GkMyOrOQ4Rv8rV4teQjMtPogMcqdFuc=
=lYXu
-END PGP SIGNATURE-


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org