Re: [Linux-ha-dev] "start-delay" parameter for monitor operation and "Split-Brain"

2007-04-12 Thread Andrew Beekhof

On 4/12/07, 池田淳子 <[EMAIL PROTECTED]> wrote:

Hi all,

I'm newbie, and trying to understand how or when "Split-Brain" happens.


when 1 or more nodes can't communicate with each other

there is no connection between this and an operation's "start-delay"


what you *might* be seeing is an old bug that was triggered when
start-delay > timeout


As trial, I run "Dummy" resource for now.
Heartbeat version is 2.0.8,
cib.xml, ha.cf and ha-log are attached.
See below cases, please give some advice.

*** case 01 ***
My cib.xml was created using hb_gui, so start-delay was "1m".
I run Heartbeat on 2 nodes at first, and disconnected the interconnect LAN
on DC node to lead "Split-Brain". (# ifdown eth2)
After making "Split-Brain", I would up the LAN with "ifup eth2".
If eth2 is restored after "Action Dummy01_monitor_1 (x) confirmed" on
the former stand-by node, it seems that everything works well.
But if it's done before the confirmation, there is something wrong.
In wrong case, I couldn't generate "Split-Brain" again.
I found that pengine/tengine run on both nodes, but one node kept trying to
be DC and Dummy resource wouldn't start on that node, additionally,
failcount was incremented on that strange node.

log message is here...
WARN: do_dc_join_finalize: join-3: We are still in a transition.  Delaying
until the TE completes.

in this case, I couldn't shutdown Heartbeat process without KILL command...

*** case 02 ***
I changed start-delay from 1m to 0s.
The confirmation process for monitor would work immediately, so though I put
ifdown/ifup in a row, it didn't matter.

Q1;  My guess, if the interconnect LAN is down/up in a raw and some
operations aren't confirmed, one node would consider this situation as
ERROR, so update its failcaount. This case might appear when "start-delay"
is long (ex, "1m"), and cause some strange "Split-Brain".
Is this relationship between "start-delay" and "Split-Brain" correct?

*** case 03 ***
In case 02, the interconnect LAN is up after "Split-Brain" is completed, it
means each node is voted as DC.
For third test, I tried to down/up the interconnect LAN before the DC
election didn't finish.
In the result, I met "Split-Brain" again but one node stayed OFFLINE when
the interconnect LAN was up.

Q2; I know, I had better to set up two or more interconnect LANs just in
case, but are there any prefer ways to avoid case 03?  ex, tuning some
parameters or something like that.


Best Regards,
Junko Ikeda

NTT DATA INTELLILINK CORPORATION
Open Source Solutions Business Unit
Open Source Business Division

Toyosu Center Building Annex, 3-3-9, Toyosu,
Koto-ku, Tokyo 135-0061, Japan
TEL : +81-3-3534-4811
FAX : +81-3-3534-4814
mailto:[EMAIL PROTECTED]
http://www.intellilink.co.jp/

___
Linux-HA-Dev: [EMAIL PROTECTED]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/




___
Linux-HA-Dev: [EMAIL PROTECTED]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] "start-delay" parameter for monitor operation and "Split-Brain"

2007-04-12 Thread Alan Robertson
池田淳子 wrote:
> Hi all,
> 
> I'm newbie, and trying to understand how or when "Split-Brain" happens.
> As trial, I run "Dummy" resource for now.

Split-brain occurs when there is a total communication failure (from the
heartbeat perspective) between at least two different cluster nodes.

> Heartbeat version is 2.0.8,
> cib.xml, ha.cf and ha-log are attached.
> See below cases, please give some advice.
> 
> *** case 01 ***
> My cib.xml was created using hb_gui, so start-delay was "1m".
> I run Heartbeat on 2 nodes at first, and disconnected the interconnect LAN
> on DC node to lead "Split-Brain". (# ifdown eth2)
> After making "Split-Brain", I would up the LAN with "ifup eth2".
> If eth2 is restored after "Action Dummy01_monitor_1 (x) confirmed" on
> the former stand-by node, it seems that everything works well.
> But if it's done before the confirmation, there is something wrong.
> In wrong case, I couldn't generate "Split-Brain" again.
> I found that pengine/tengine run on both nodes, but one node kept trying to
> be DC and Dummy resource wouldn't start on that node, additionally,
> failcount was incremented on that strange node.
> 
> log message is here... 
> WARN: do_dc_join_finalize: join-3: We are still in a transition.  Delaying
> until the TE completes.
> 
> in this case, I couldn't shutdown Heartbeat process without KILL command...

There are some known bugs regarding not being able to shut down.
They're fixed in the upcoming version.

> *** case 02 ***
> I changed start-delay from 1m to 0s.
> The confirmation process for monitor would work immediately, so though I put
> ifdown/ifup in a row, it didn't matter.
> 
> Q1;  My guess, if the interconnect LAN is down/up in a raw and some
> operations aren't confirmed, one node would consider this situation as
> ERROR, so update its failcaount. This case might appear when "start-delay"
> is long (ex, "1m"), and cause some strange "Split-Brain". 
> Is this relationship between "start-delay" and "Split-Brain" correct?

As Andrew pointed out, there is no connection.

> *** case 03 ***
> In case 02, the interconnect LAN is up after "Split-Brain" is completed, it
> means each node is voted as DC.
> For third test, I tried to down/up the interconnect LAN before the DC
> election didn't finish.
> In the result, I met "Split-Brain" again but one node stayed OFFLINE when
> the interconnect LAN was up.
> 
> Q2; I know, I had better to set up two or more interconnect LANs just in
> case, but are there any prefer ways to avoid case 03?  ex, tuning some
> parameters or something like that.

The nodes should go online again after rejoining.  What version is this
with?  Can you supply logs for what happened in this case?

By the way, doing ifup/ifdown of interfaces you're using for heartbeat
connections has been known to to make heartbeat sick.  Plugging and
unplugging the connectors, or adding/removing firewall rules does not do
this.

-- 
Alan Robertson <[EMAIL PROTECTED]>

"Openness is the foundation and preservative of friendship...  Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
___
Linux-HA-Dev: [EMAIL PROTECTED]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] "start-delay" parameter for monitor operation and"Split-Brain"

2007-04-17 Thread Andrew Beekhof

On 4/13/07, 池田淳子 <[EMAIL PROTECTED]> wrote:

Hi Andres, Alan

Thank you for your comments.
I try to run Dummy resource using these parameters this time.

Heartbeat version is 2.0.8.

case01;
Heartbeat run well on 2 nodes, and I brought down the interconnect LAN from
network switch. (shutdown the port)
Split-Brain has come. Stan-by node would be a DC. I brought up the LAN
immediately.
If " Action Dummy01_monitor_1 " wasn't confirmed on the former stand-by
node when the LAN was up, that node would do some strange behavior at the
next Split-Brain.


I'm lost... what are you talking about here?
I also don't see anything like this in the logs



I can see the following message after I brought down the LAN again...

WARN: do_dc_join_finalize: join-2: We are still in a transition.  Delaying
until the TE completes.

It seems that one node tries to join something, but it keeps failing.
Is this the correct behavior of Heartbeat2?


it looks like a bug
any chance you could add  "debug 1" to ha.cf and reproduce it?
ideally with the latest development version which should be released soonish



case02;
I did down/up the interconnect LAN again.
In this case, LAN trouble would be recovered before the DC election on
stand-by node. After recovering, the former stand-by nodes keeps its status
as OFFLINE.


well they didnt recover then did they :-)


this is without doubt a CCM bug

if one greps for the following patterns:
#  grep -e cib.*ccm -e cib.*mem_handle_event
split-brain2/case01/dl380g5a-ha-log

then you can clearly see the CCM "instance" going backwards from 3 to 2.

cib[17889]: 2007/04/13_11:24:49 info: mem_handle_event: Got an event
OC_EV_MS_INVALID from ccm
cib[17889]: 2007/04/13_11:24:49 info: mem_handle_event: Got an event
OC_EV_MS_NEW_MEMBERSHIP from ccm
cib[17889]: 2007/04/13_11:24:49 info: mem_handle_event: instance=3,
nodes=1, new=0, lost=1, n_idx=0, new_idx=1, old_idx=3
cib[17889]: 2007/04/13_11:24:49 info: cib_ccm_msg_callback: LOST: dl380g5b
cib[17889]: 2007/04/13_11:24:49 info: cib_ccm_msg_callback: PEER: dl380g5a
cib[17889]: 2007/04/13_11:26:08 info: mem_handle_event: Got an event
OC_EV_MS_INVALID from ccm
cib[17889]: 2007/04/13_11:26:08 info: mem_handle_event: Got an event
OC_EV_MS_NEW_MEMBERSHIP from ccm
cib[17889]: 2007/04/13_11:26:08 info: mem_handle_event: instance=2,
nodes=2, new=1, lost=0, n_idx=0, new_idx=2, old_idx=4
cib[17889]: 2007/04/13_11:26:08 info: cib_ccm_msg_callback: PEER: dl380g5b
cib[17889]: 2007/04/13_11:26:08 info: cib_ccm_msg_callback: PEER: dl380g5a

likewise in split-brain2/case02/dl380g5b-ha-log you can see the
progression 1->2->3->1->2


Alan, see the attached logs, for details.

Best Regards,
Junko Ikeda


-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of 池田淳子
Sent: Thursday, April 12, 2007 8:16 PM
To: linux-ha-dev@lists.linux-ha.org
Subject: [Linux-ha-dev] "start-delay" parameter for monitor operation
and"Split-Brain"

Hi all,

I'm newbie, and trying to understand how or when "Split-Brain" happens.
As trial, I run "Dummy" resource for now.
Heartbeat version is 2.0.8,
cib.xml, ha.cf and ha-log are attached.
See below cases, please give some advice.

*** case 01 ***
My cib.xml was created using hb_gui, so start-delay was "1m".
I run Heartbeat on 2 nodes at first, and disconnected the interconnect LAN
on DC node to lead "Split-Brain". (# ifdown eth2)
After making "Split-Brain", I would up the LAN with "ifup eth2".
If eth2 is restored after "Action Dummy01_monitor_1 (x) confirmed" on
the former stand-by node, it seems that everything works well.
But if it's done before the confirmation, there is something wrong.
In wrong case, I couldn't generate "Split-Brain" again.
I found that pengine/tengine run on both nodes, but one node kept trying to
be DC and Dummy resource wouldn't start on that node, additionally,
failcount was incremented on that strange node.

log message is here...
WARN: do_dc_join_finalize: join-3: We are still in a transition.  Delaying
until the TE completes.

in this case, I couldn't shutdown Heartbeat process without KILL command...

*** case 02 ***
I changed start-delay from 1m to 0s.
The confirmation process for monitor would work immediately, so though I put
ifdown/ifup in a row, it didn't matter.

Q1;  My guess, if the interconnect LAN is down/up in a raw and some
operations aren't confirmed, one node would consider this situation as
ERROR, so update its failcaount. This case might appear when "start-delay"
is long (ex, "1m"), and cause some strange "Split-Brain".
Is this relationship between "start-delay" and "Split-Brain" correct?

*** case 03 ***
In case 02, the interconnect LAN is up after "Split-Brain" is completed, it
means each node is

Re: [Linux-ha-dev] "start-delay" parameter for monitor operation and "Split-Brain"

2007-04-19 Thread Andrew Beekhof
"instance" going backwards from 3 to 2.

cib[17889]: 2007/04/13_11:24:49 info: mem_handle_event: Got an event
OC_EV_MS_INVALID from ccm
cib[17889]: 2007/04/13_11:24:49 info: mem_handle_event: Got an event
OC_EV_MS_NEW_MEMBERSHIP from ccm
cib[17889]: 2007/04/13_11:24:49 info: mem_handle_event: instance=3,
nodes=1, new=0, lost=1, n_idx=0, new_idx=1, old_idx=3
cib[17889]: 2007/04/13_11:24:49 info: cib_ccm_msg_callback: LOST: dl380g5b
cib[17889]: 2007/04/13_11:24:49 info: cib_ccm_msg_callback: PEER: dl380g5a
cib[17889]: 2007/04/13_11:26:08 info: mem_handle_event: Got an event
OC_EV_MS_INVALID from ccm
cib[17889]: 2007/04/13_11:26:08 info: mem_handle_event: Got an event
OC_EV_MS_NEW_MEMBERSHIP from ccm
cib[17889]: 2007/04/13_11:26:08 info: mem_handle_event: instance=2,
nodes=2, new=1, lost=0, n_idx=0, new_idx=2, old_idx=4
cib[17889]: 2007/04/13_11:26:08 info: cib_ccm_msg_callback: PEER: dl380g5b
cib[17889]: 2007/04/13_11:26:08 info: cib_ccm_msg_callback: PEER: dl380g5a

likewise in split-brain2/case02/dl380g5b-ha-log you can see the
progression 1->2->3->1->2

> Alan, see the attached logs, for details.
>
> Best Regards,
> Junko Ikeda
>
>
> -----Original Message-
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of 池田淳子
> Sent: Thursday, April 12, 2007 8:16 PM
> To: linux-ha-dev@lists.linux-ha.org
> Subject: [Linux-ha-dev] "start-delay" parameter for monitor operation
> and"Split-Brain"
>
> Hi all,
>
> I'm newbie, and trying to understand how or when "Split-Brain" happens.
> As trial, I run "Dummy" resource for now.
> Heartbeat version is 2.0.8,
> cib.xml, ha.cf and ha-log are attached.
> See below cases, please give some advice.
>
> *** case 01 ***
> My cib.xml was created using hb_gui, so start-delay was "1m".
> I run Heartbeat on 2 nodes at first, and disconnected the interconnect LAN
> on DC node to lead "Split-Brain". (# ifdown eth2)
> After making "Split-Brain", I would up the LAN with "ifup eth2".
> If eth2 is restored after "Action Dummy01_monitor_1 (x) confirmed" on
> the former stand-by node, it seems that everything works well.
> But if it's done before the confirmation, there is something wrong.
> In wrong case, I couldn't generate "Split-Brain" again.
> I found that pengine/tengine run on both nodes, but one node kept trying
to
> be DC and Dummy resource wouldn't start on that node, additionally,
> failcount was incremented on that strange node.
>
> log message is here...
> WARN: do_dc_join_finalize: join-3: We are still in a transition.  Delaying
> until the TE completes.
>
> in this case, I couldn't shutdown Heartbeat process without KILL
command...
>
> *** case 02 ***
> I changed start-delay from 1m to 0s.
> The confirmation process for monitor would work immediately, so though I
put
> ifdown/ifup in a row, it didn't matter.
>
> Q1;  My guess, if the interconnect LAN is down/up in a raw and some
> operations aren't confirmed, one node would consider this situation as
> ERROR, so update its failcaount. This case might appear when "start-delay"
> is long (ex, "1m"), and cause some strange "Split-Brain".
> Is this relationship between "start-delay" and "Split-Brain" correct?
>
> *** case 03 ***
> In case 02, the interconnect LAN is up after "Split-Brain" is completed,
it
> means each node is voted as DC.
> For third test, I tried to down/up the interconnect LAN before the DC
> election didn't finish.
> In the result, I met "Split-Brain" again but one node stayed OFFLINE when
> the interconnect LAN was up.
>
> Q2; I know, I had better to set up two or more interconnect LANs just in
> case, but are there any prefer ways to avoid case 03?  ex, tuning some
> parameters or something like that.
>
>
> Best Regards,
> Junko Ikeda
>
> NTT DATA INTELLILINK CORPORATION
> Open Source Solutions Business Unit
> Open Source Business Division
>
> Toyosu Center Building Annex, 3-3-9, Toyosu,
> Koto-ku, Tokyo 135-0061, Japan
> TEL : +81-3-3534-4811
> FAX : +81-3-3534-4814
> mailto:[EMAIL PROTECTED]
> http://www.intellilink.co.jp/
>
> ___
> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
>
>
>
___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/




___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-ha-dev] "start-delay" parameter for monitor operation and"Split-Brain"

2007-04-19 Thread Andrew Beekhof

On 4/19/07, Junko IKEDA <[EMAIL PROTECTED]> wrote:

> -Original Message-
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of Andrew
Beekhof
> Sent: Thursday, April 19, 2007 4:47 PM
> To: High-Availability Linux Development List
> Subject: Re: [Linux-ha-dev] "start-delay" parameter for monitor operation
> and"Split-Brain"
>
> > I just try to replicate the circumstances that is a temporary blackout
of
> > the interconnect LAN.
> > When some nodes resolve their Split-Brain,
> > (1) If the LAN recovers before some operations don't confirm (it's a
> > Dummy01_monitor_1, this time),
>
> i dont see this happening anywhere in the logs you have supplied

well... it's in the 394th line that I attached to this mail.


thats showing that the action _was_ confirmed



the CCM "instance" going backwards from 3 to 2, so this is the same case
that you have already pointed out as a bug.
It was posted to a bugzilla; #1546.
I can see this case with version 2.0.8-1 and 2.0.9-1.

Thanks,
Junko Ikeda

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/




___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/