Florian(and all), thanks for the reply.

I've gone over past threads on the DRBD list as you suggested, and found only 
this:
http://archives.free.net.ph/message/20090909.131635.ef640f6a.en.html

I am not entirely certain what specific problem the 
one-separate-cluster-at-each-site  design addresses that one-node-on-each-site 
does not.

>From the above thread, the only roadblock explicitly mentioned was setting up 
>cross-site multicast routing, which needs to be made to work. Fair enough.

I'd like to get a clear idea of what the roadblocks --actually are-- (not on a 
"The WAN link" level but what the WAN link -actually breaks-) to doing what I 
suggested.

Assuming I can get it to work, are there any other specific reasons it 
wouldn't? 

To recap, in my proposed solution, an outage will result in four things:
---
1. A "Race" by both nodes to a 3rd site, to perform an atomic operation (a 
mkdir for instance). Following it, it will be abundantly clear to both nodes 
"who is right, and who is dead".
---
2. A hard-iLO-poweroff STONITH (NOT reboot!) from the winner to the loser's 
iLO. It can  also iptables-block all comms from the loser until further notice 
as an extra safety-net. 
---
3. A hard-own-iLO-poweroff-else-kernel-halt SMITH (NOT reboot!) suicide by the 
loser (SMITH is our pet acronym for Shoot-Myself-...).
---
4. A "WAN-PROBLEM=[true|false] flag immediately raised (locally) by the winner 
based on pinging the OTHER SITE's ROUTER. A separate resource on the winner 
will, in the presence of this flag, monitor the same router of the other site 
for life, and when the other site comes back up (perhaps 
-and-stays-up-for-an-hour- or some similar flap-avoiding logic) issues a 
POWERON to the other node's iLO which will come back up as a drbd slave, resync 
and get re-promoted to master.

As an attractive side-benefit, this is a deathmatch-proof design.

----

NOTE: There's a departure from common wisdom here, and I am not sure whether 
this one of the issues you're pointing at. 
Common wisdom states: SMITH BAD, not reliable (obvious reasons - no 
success/failure etc)

In this solution I claim: SMIT BAD, not reliable, except in one specific 
failure mode (WAN outage) where SMITH GOOD, is reliable, shortcomings can be 
worked around.

both steps [2] and [3] are issued on EVERY TYPE of outage, regardless of 
whether it's WAN-related or not. 
In non-WAN issues the loser is considered compromised, thus making [3] 
unreliable, but [2] is reliable.
In WAN issues, the WAN is considered compromised, thus making [2] unreliable, 
but the node itself is sound, so [3] still is reliable.

To sum up, it looks to me like the "data safety" is provided by the layer 
underneath DRBD, not DRBD itself, and if it works as advertised, DRBD should 
have no problem, thus we have a system sufficiently reliable to withstand any 
scenario short of a double failure. 

... thoughts?
--

-----Original Message-----
From: Florian Haas [mailto:florian.h...@linbit.com] 
Sent: Monday, 18 January 2010 9:36 PM
To: pacemaker@oss.clusterlabs.org
Subject: Re: [Pacemaker] Split Site 2-way clusters

On 2010-01-18 11:14, Andrew Beekhof wrote:
> On Thu, Jan 14, 2010 at 11:44 PM, Miki Shapiro 
> <miki.shap...@coles.com.au> wrote:
>> Confused.
>>
>>
>>
>> I *am* running DRBD in dual-master mode
> 
> /me cringes... this sounds to me like an impossibly dangerous idea.
> Can someone from linbit comment on this please?  Am I imagining this?

Dual-Primary DRBD in a split site cluster? Really really bad idea.
Anyone attempting this, please search the drbd-user archives for multiple 
discussions about this in the past. Then reconsider.

Hope that makes it clear enough.
Florian








______________________________________________________________________
This email and any attachments may contain privileged and confidential
information and are intended for the named addressee only. If you have
received this e-mail in error, please notify the sender and delete
this e-mail immediately. Any confidentiality, privilege or copyright
is not waived or lost because this e-mail has been sent to you in
error. It is your responsibility to check this e-mail and any
attachments for viruses.  No warranty is made that this material is
free from computer virus or any other defect or error.  Any
loss/damage incurred by using this material is not the sender's
responsibility.  The sender's entire liability will be limited to
resupplying the material.
______________________________________________________________________

_______________________________________________
Pacemaker mailing list
Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Reply via email to