Re: [Linux-ha-dev] "start-delay" parameter for monitor operation and "Split-Brain"

Alan Robertson Thu, 12 Apr 2007 07:21:21 -0700

池田淳子 wrote:
> Hi all,
> 
> I'm newbie, and trying to understand how or when "Split-Brain" happens.
> As trial, I run "Dummy" resource for now.


Split-brain occurs when there is a total communication failure (from the
heartbeat perspective) between at least two different cluster nodes.

> Heartbeat version is 2.0.8,
> cib.xml, ha.cf and ha-log are attached.
> See below cases, please give some advice.
> 
> *** case 01 ***
> My cib.xml was created using hb_gui, so start-delay was "1m".
> I run Heartbeat on 2 nodes at first, and disconnected the interconnect LAN
> on DC node to lead "Split-Brain". (# ifdown eth2)
> After making "Split-Brain", I would up the LAN with "ifup eth2".
> If eth2 is restored after "Action Dummy01_monitor_10000 (x) confirmed" on
> the former stand-by node, it seems that everything works well.
> But if it's done before the confirmation, there is something wrong.
> In wrong case, I couldn't generate "Split-Brain" again.
> I found that pengine/tengine run on both nodes, but one node kept trying to
> be DC and Dummy resource wouldn't start on that node, additionally,
> failcount was incremented on that strange node.
> 
> log message is here... 
> WARN: do_dc_join_finalize: join-3: We are still in a transition.  Delaying
> until the TE completes.
> 
> in this case, I couldn't shutdown Heartbeat process without KILL command...

There are some known bugs regarding not being able to shut down.
They're fixed in the upcoming version.

> *** case 02 ***
> I changed start-delay from 1m to 0s.
> The confirmation process for monitor would work immediately, so though I put
> ifdown/ifup in a row, it didn't matter.
> 
> Q1;  My guess, if the interconnect LAN is down/up in a raw and some
> operations aren't confirmed, one node would consider this situation as
> ERROR, so update its failcaount. This case might appear when "start-delay"
> is long (ex, "1m"), and cause some strange "Split-Brain". 
> Is this relationship between "start-delay" and "Split-Brain" correct?

As Andrew pointed out, there is no connection.

> *** case 03 ***
> In case 02, the interconnect LAN is up after "Split-Brain" is completed, it
> means each node is voted as DC.
> For third test, I tried to down/up the interconnect LAN before the DC
> election didn't finish.
> In the result, I met "Split-Brain" again but one node stayed OFFLINE when
> the interconnect LAN was up.
> 
> Q2; I know, I had better to set up two or more interconnect LANs just in
> case, but are there any prefer ways to avoid case 03?  ex, tuning some
> parameters or something like that.

The nodes should go online again after rejoining.  What version is this
with?  Can you supply logs for what happened in this case?

By the way, doing ifup/ifdown of interfaces you're using for heartbeat
connections has been known to to make heartbeat sick.  Plugging and
unplugging the connectors, or adding/removing firewall rules does not do
this.

-- 
    Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
_______________________________________________________
Linux-HA-Dev: [EMAIL PROTECTED]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] "start-delay" parameter for monitor operation and "Split-Brain"

Reply via email to