Stallmann, Andreas wrote:
Hi there!

I have set up a two-node heartbeat cluster running
apache and drbd.
Everthing went fine, till we tested a "split brain"
scenario. In this case, when we detach both network
cables from one host, we get a two-primary situation.

I read in the thread "methods of dealing with network failover"
that setting up stonith and a quorum-node might be a good workaround.

Well... it isn't in our situation, I think.

Let's assume, we have the following scenario:

- The two nodes, having two interfaces each, monitor each other via unicast queries over both interfaces.
- We do not have any dedicated cross-over or serial connections,
because the servers reside in buildings a few kilometers appart. - We have only the two Linux nodes in our network which are
part of our cluster (well, a few more to be honest, but those
are the two we may fiddle arround with). - We won't be able to set up a (dedicated) quorum server.
- We do not have a network enabled power socket we might deactivate
for the node which we want to "shot in the head".

Now someone stumbles over the network cables of, lets say, node-b, detaching it from the network.

node-b and node-a do not receive any unicast replies from their peer
anymore, but node-a can still ping it's ping host, while node-a can't.

node-b should now assume, that it's very likely dead.
node-a should assume can't be sure, because it can't reach it's peer
but still can reach the rest of the network (or at least it's ping
node).

Actually, I'd like to see the following happen:

- If a node is secondary and assumes, that it's very likely dead, it should not be allowed to take over any ressources.
- If a node is primary and isn't sure about it's peer, it should
"freeze" it's state at least till it's peer is reachable over one
interface.

That's about exactly what the dopd (drbd peer outdater daemon) is for.

Look into http://blogs.linbit.com/florian/2007/10/01/an-underrated-cluster-admins-companion-dopd/

Dopd was rather unuseable for the last few weeks(/months?), but I read it recently received a bunch of fixes and is supposed to work now (as of the drbd-user mailing list and Lars Ellenberg (one of the main authors of drbd)).

Refer to that mailing-list and the blog entry. I'd be glad if you told us how this worked out.

Regards
Dominik
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to