Re: [Linux-ha-dev] "start-delay" parameter for monitor operation and "Split-Brain"
On 4/12/07, 池田淳子 <[EMAIL PROTECTED]> wrote: Hi all, I'm newbie, and trying to understand how or when "Split-Brain" happens. when 1 or more nodes can't communicate with each other there is no connection between this and an operation's "start-delay" what you *might* be seeing is an old bug that was triggered when start-delay > timeout As trial, I run "Dummy" resource for now. Heartbeat version is 2.0.8, cib.xml, ha.cf and ha-log are attached. See below cases, please give some advice. *** case 01 *** My cib.xml was created using hb_gui, so start-delay was "1m". I run Heartbeat on 2 nodes at first, and disconnected the interconnect LAN on DC node to lead "Split-Brain". (# ifdown eth2) After making "Split-Brain", I would up the LAN with "ifup eth2". If eth2 is restored after "Action Dummy01_monitor_1 (x) confirmed" on the former stand-by node, it seems that everything works well. But if it's done before the confirmation, there is something wrong. In wrong case, I couldn't generate "Split-Brain" again. I found that pengine/tengine run on both nodes, but one node kept trying to be DC and Dummy resource wouldn't start on that node, additionally, failcount was incremented on that strange node. log message is here... WARN: do_dc_join_finalize: join-3: We are still in a transition. Delaying until the TE completes. in this case, I couldn't shutdown Heartbeat process without KILL command... *** case 02 *** I changed start-delay from 1m to 0s. The confirmation process for monitor would work immediately, so though I put ifdown/ifup in a row, it didn't matter. Q1; My guess, if the interconnect LAN is down/up in a raw and some operations aren't confirmed, one node would consider this situation as ERROR, so update its failcaount. This case might appear when "start-delay" is long (ex, "1m"), and cause some strange "Split-Brain". Is this relationship between "start-delay" and "Split-Brain" correct? *** case 03 *** In case 02, the interconnect LAN is up after "Split-Brain" is completed, it means each node is voted as DC. For third test, I tried to down/up the interconnect LAN before the DC election didn't finish. In the result, I met "Split-Brain" again but one node stayed OFFLINE when the interconnect LAN was up. Q2; I know, I had better to set up two or more interconnect LANs just in case, but are there any prefer ways to avoid case 03? ex, tuning some parameters or something like that. Best Regards, Junko Ikeda NTT DATA INTELLILINK CORPORATION Open Source Solutions Business Unit Open Source Business Division Toyosu Center Building Annex, 3-3-9, Toyosu, Koto-ku, Tokyo 135-0061, Japan TEL : +81-3-3534-4811 FAX : +81-3-3534-4814 mailto:[EMAIL PROTECTED] http://www.intellilink.co.jp/ ___ Linux-HA-Dev: [EMAIL PROTECTED] http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: [EMAIL PROTECTED] http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] "start-delay" parameter for monitor operation and "Split-Brain"
池田淳子 wrote: > Hi all, > > I'm newbie, and trying to understand how or when "Split-Brain" happens. > As trial, I run "Dummy" resource for now. Split-brain occurs when there is a total communication failure (from the heartbeat perspective) between at least two different cluster nodes. > Heartbeat version is 2.0.8, > cib.xml, ha.cf and ha-log are attached. > See below cases, please give some advice. > > *** case 01 *** > My cib.xml was created using hb_gui, so start-delay was "1m". > I run Heartbeat on 2 nodes at first, and disconnected the interconnect LAN > on DC node to lead "Split-Brain". (# ifdown eth2) > After making "Split-Brain", I would up the LAN with "ifup eth2". > If eth2 is restored after "Action Dummy01_monitor_1 (x) confirmed" on > the former stand-by node, it seems that everything works well. > But if it's done before the confirmation, there is something wrong. > In wrong case, I couldn't generate "Split-Brain" again. > I found that pengine/tengine run on both nodes, but one node kept trying to > be DC and Dummy resource wouldn't start on that node, additionally, > failcount was incremented on that strange node. > > log message is here... > WARN: do_dc_join_finalize: join-3: We are still in a transition. Delaying > until the TE completes. > > in this case, I couldn't shutdown Heartbeat process without KILL command... There are some known bugs regarding not being able to shut down. They're fixed in the upcoming version. > *** case 02 *** > I changed start-delay from 1m to 0s. > The confirmation process for monitor would work immediately, so though I put > ifdown/ifup in a row, it didn't matter. > > Q1; My guess, if the interconnect LAN is down/up in a raw and some > operations aren't confirmed, one node would consider this situation as > ERROR, so update its failcaount. This case might appear when "start-delay" > is long (ex, "1m"), and cause some strange "Split-Brain". > Is this relationship between "start-delay" and "Split-Brain" correct? As Andrew pointed out, there is no connection. > *** case 03 *** > In case 02, the interconnect LAN is up after "Split-Brain" is completed, it > means each node is voted as DC. > For third test, I tried to down/up the interconnect LAN before the DC > election didn't finish. > In the result, I met "Split-Brain" again but one node stayed OFFLINE when > the interconnect LAN was up. > > Q2; I know, I had better to set up two or more interconnect LANs just in > case, but are there any prefer ways to avoid case 03? ex, tuning some > parameters or something like that. The nodes should go online again after rejoining. What version is this with? Can you supply logs for what happened in this case? By the way, doing ifup/ifdown of interfaces you're using for heartbeat connections has been known to to make heartbeat sick. Plugging and unplugging the connectors, or adding/removing firewall rules does not do this. -- Alan Robertson <[EMAIL PROTECTED]> "Openness is the foundation and preservative of friendship... Let me claim from you at all times your undisguised opinions." - William Wilberforce ___ Linux-HA-Dev: [EMAIL PROTECTED] http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] "start-delay" parameter for monitor operation and"Split-Brain"
On 4/13/07, 池田淳子 <[EMAIL PROTECTED]> wrote: Hi Andres, Alan Thank you for your comments. I try to run Dummy resource using these parameters this time. Heartbeat version is 2.0.8. case01; Heartbeat run well on 2 nodes, and I brought down the interconnect LAN from network switch. (shutdown the port) Split-Brain has come. Stan-by node would be a DC. I brought up the LAN immediately. If " Action Dummy01_monitor_1 " wasn't confirmed on the former stand-by node when the LAN was up, that node would do some strange behavior at the next Split-Brain. I'm lost... what are you talking about here? I also don't see anything like this in the logs I can see the following message after I brought down the LAN again... WARN: do_dc_join_finalize: join-2: We are still in a transition. Delaying until the TE completes. It seems that one node tries to join something, but it keeps failing. Is this the correct behavior of Heartbeat2? it looks like a bug any chance you could add "debug 1" to ha.cf and reproduce it? ideally with the latest development version which should be released soonish case02; I did down/up the interconnect LAN again. In this case, LAN trouble would be recovered before the DC election on stand-by node. After recovering, the former stand-by nodes keeps its status as OFFLINE. well they didnt recover then did they :-) this is without doubt a CCM bug if one greps for the following patterns: # grep -e cib.*ccm -e cib.*mem_handle_event split-brain2/case01/dl380g5a-ha-log then you can clearly see the CCM "instance" going backwards from 3 to 2. cib[17889]: 2007/04/13_11:24:49 info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm cib[17889]: 2007/04/13_11:24:49 info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm cib[17889]: 2007/04/13_11:24:49 info: mem_handle_event: instance=3, nodes=1, new=0, lost=1, n_idx=0, new_idx=1, old_idx=3 cib[17889]: 2007/04/13_11:24:49 info: cib_ccm_msg_callback: LOST: dl380g5b cib[17889]: 2007/04/13_11:24:49 info: cib_ccm_msg_callback: PEER: dl380g5a cib[17889]: 2007/04/13_11:26:08 info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm cib[17889]: 2007/04/13_11:26:08 info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm cib[17889]: 2007/04/13_11:26:08 info: mem_handle_event: instance=2, nodes=2, new=1, lost=0, n_idx=0, new_idx=2, old_idx=4 cib[17889]: 2007/04/13_11:26:08 info: cib_ccm_msg_callback: PEER: dl380g5b cib[17889]: 2007/04/13_11:26:08 info: cib_ccm_msg_callback: PEER: dl380g5a likewise in split-brain2/case02/dl380g5b-ha-log you can see the progression 1->2->3->1->2 Alan, see the attached logs, for details. Best Regards, Junko Ikeda -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of 池田淳子 Sent: Thursday, April 12, 2007 8:16 PM To: linux-ha-dev@lists.linux-ha.org Subject: [Linux-ha-dev] "start-delay" parameter for monitor operation and"Split-Brain" Hi all, I'm newbie, and trying to understand how or when "Split-Brain" happens. As trial, I run "Dummy" resource for now. Heartbeat version is 2.0.8, cib.xml, ha.cf and ha-log are attached. See below cases, please give some advice. *** case 01 *** My cib.xml was created using hb_gui, so start-delay was "1m". I run Heartbeat on 2 nodes at first, and disconnected the interconnect LAN on DC node to lead "Split-Brain". (# ifdown eth2) After making "Split-Brain", I would up the LAN with "ifup eth2". If eth2 is restored after "Action Dummy01_monitor_1 (x) confirmed" on the former stand-by node, it seems that everything works well. But if it's done before the confirmation, there is something wrong. In wrong case, I couldn't generate "Split-Brain" again. I found that pengine/tengine run on both nodes, but one node kept trying to be DC and Dummy resource wouldn't start on that node, additionally, failcount was incremented on that strange node. log message is here... WARN: do_dc_join_finalize: join-3: We are still in a transition. Delaying until the TE completes. in this case, I couldn't shutdown Heartbeat process without KILL command... *** case 02 *** I changed start-delay from 1m to 0s. The confirmation process for monitor would work immediately, so though I put ifdown/ifup in a row, it didn't matter. Q1; My guess, if the interconnect LAN is down/up in a raw and some operations aren't confirmed, one node would consider this situation as ERROR, so update its failcaount. This case might appear when "start-delay" is long (ex, "1m"), and cause some strange "Split-Brain". Is this relationship between "start-delay" and "Split-Brain" correct? *** case 03 *** In case 02, the interconnect LAN is up after "Split-Brain" is completed, it means each node is
Re: [Linux-ha-dev] "start-delay" parameter for monitor operation and "Split-Brain"
"instance" going backwards from 3 to 2. cib[17889]: 2007/04/13_11:24:49 info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm cib[17889]: 2007/04/13_11:24:49 info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm cib[17889]: 2007/04/13_11:24:49 info: mem_handle_event: instance=3, nodes=1, new=0, lost=1, n_idx=0, new_idx=1, old_idx=3 cib[17889]: 2007/04/13_11:24:49 info: cib_ccm_msg_callback: LOST: dl380g5b cib[17889]: 2007/04/13_11:24:49 info: cib_ccm_msg_callback: PEER: dl380g5a cib[17889]: 2007/04/13_11:26:08 info: mem_handle_event: Got an event OC_EV_MS_INVALID from ccm cib[17889]: 2007/04/13_11:26:08 info: mem_handle_event: Got an event OC_EV_MS_NEW_MEMBERSHIP from ccm cib[17889]: 2007/04/13_11:26:08 info: mem_handle_event: instance=2, nodes=2, new=1, lost=0, n_idx=0, new_idx=2, old_idx=4 cib[17889]: 2007/04/13_11:26:08 info: cib_ccm_msg_callback: PEER: dl380g5b cib[17889]: 2007/04/13_11:26:08 info: cib_ccm_msg_callback: PEER: dl380g5a likewise in split-brain2/case02/dl380g5b-ha-log you can see the progression 1->2->3->1->2 > Alan, see the attached logs, for details. > > Best Regards, > Junko Ikeda > > > -----Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of 池田淳子 > Sent: Thursday, April 12, 2007 8:16 PM > To: linux-ha-dev@lists.linux-ha.org > Subject: [Linux-ha-dev] "start-delay" parameter for monitor operation > and"Split-Brain" > > Hi all, > > I'm newbie, and trying to understand how or when "Split-Brain" happens. > As trial, I run "Dummy" resource for now. > Heartbeat version is 2.0.8, > cib.xml, ha.cf and ha-log are attached. > See below cases, please give some advice. > > *** case 01 *** > My cib.xml was created using hb_gui, so start-delay was "1m". > I run Heartbeat on 2 nodes at first, and disconnected the interconnect LAN > on DC node to lead "Split-Brain". (# ifdown eth2) > After making "Split-Brain", I would up the LAN with "ifup eth2". > If eth2 is restored after "Action Dummy01_monitor_1 (x) confirmed" on > the former stand-by node, it seems that everything works well. > But if it's done before the confirmation, there is something wrong. > In wrong case, I couldn't generate "Split-Brain" again. > I found that pengine/tengine run on both nodes, but one node kept trying to > be DC and Dummy resource wouldn't start on that node, additionally, > failcount was incremented on that strange node. > > log message is here... > WARN: do_dc_join_finalize: join-3: We are still in a transition. Delaying > until the TE completes. > > in this case, I couldn't shutdown Heartbeat process without KILL command... > > *** case 02 *** > I changed start-delay from 1m to 0s. > The confirmation process for monitor would work immediately, so though I put > ifdown/ifup in a row, it didn't matter. > > Q1; My guess, if the interconnect LAN is down/up in a raw and some > operations aren't confirmed, one node would consider this situation as > ERROR, so update its failcaount. This case might appear when "start-delay" > is long (ex, "1m"), and cause some strange "Split-Brain". > Is this relationship between "start-delay" and "Split-Brain" correct? > > *** case 03 *** > In case 02, the interconnect LAN is up after "Split-Brain" is completed, it > means each node is voted as DC. > For third test, I tried to down/up the interconnect LAN before the DC > election didn't finish. > In the result, I met "Split-Brain" again but one node stayed OFFLINE when > the interconnect LAN was up. > > Q2; I know, I had better to set up two or more interconnect LANs just in > case, but are there any prefer ways to avoid case 03? ex, tuning some > parameters or something like that. > > > Best Regards, > Junko Ikeda > > NTT DATA INTELLILINK CORPORATION > Open Source Solutions Business Unit > Open Source Business Division > > Toyosu Center Building Annex, 3-3-9, Toyosu, > Koto-ku, Tokyo 135-0061, Japan > TEL : +81-3-3534-4811 > FAX : +81-3-3534-4814 > mailto:[EMAIL PROTECTED] > http://www.intellilink.co.jp/ > > ___ > Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev > Home Page: http://linux-ha.org/ > > > ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/
Re: [Linux-ha-dev] "start-delay" parameter for monitor operation and"Split-Brain"
On 4/19/07, Junko IKEDA <[EMAIL PROTECTED]> wrote: > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Andrew Beekhof > Sent: Thursday, April 19, 2007 4:47 PM > To: High-Availability Linux Development List > Subject: Re: [Linux-ha-dev] "start-delay" parameter for monitor operation > and"Split-Brain" > > > I just try to replicate the circumstances that is a temporary blackout of > > the interconnect LAN. > > When some nodes resolve their Split-Brain, > > (1) If the LAN recovers before some operations don't confirm (it's a > > Dummy01_monitor_1, this time), > > i dont see this happening anywhere in the logs you have supplied well... it's in the 394th line that I attached to this mail. thats showing that the action _was_ confirmed the CCM "instance" going backwards from 3 to 2, so this is the same case that you have already pointed out as a bug. It was posted to a bugzilla; #1546. I can see this case with version 2.0.8-1 and 2.0.9-1. Thanks, Junko Ikeda ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/ ___ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/