On 2008-04-04T17:34:04, Junko IKEDA <[EMAIL PROTECTED]> wrote: > Hi, > > I am running one test for a split brain like this; > > (1) start Heartbeat (node-a/node-b) > (2) run Dummy resource on node-a > (3) down the interconnect LAN -> a split brain > (4) stop Heartbeat (only node-b) > > It might be just a little tricky, I modified Dummy RA on node-b as it could > "sleep 10" when monitor_0 (= prove) was called. > > (5) start Heartbeat (only node-b. Heartbeat keeps running on node-a, Dummy > is running on node-a) > (6) A split brain is still ongoing, so Dummy would start on node-b, despite > it's on node-a. > (7) Dummy on node-a would sleep 10 seconds before starting... I restored the > interconnect LAN at the exact moment. > > There are two results: > (case-1) ... hb_report_1 > Heartbeat can recover a split brain successfully. > Dummy would go to one side. > > (case-2) ... hb_report_2 > Heartbeat can not recover a split brain. > The each node can not let another one be added to the membership. > This is a rare case and hard to reproduce but possible. > > See attached hb_report_2/node-b/ha-log. > Heartbeat noticed that the interconnect LAN was up during a split brain. > > heartbeat[9216]: 2008/04/04_16:27:22 info: Link node-b:eth2 up. > heartbeat[9216]: 2008/04/04_16:27:23 info: Link node-a:eth2 up. > > but it didn't consider its partner as a member... > instance ID is wrong again. > crmd[9229]: 2008/04/04_16:27:23 info: ccm_event_detail: NEW MEMBERSHIP: > trans=2, nodes=1, new=0, lost=0 n_idx=0, new_idx=1, old_idx=3
That's a ccm bug. I guess it can be prevented by using STONITH ;-) Regards, Lars -- Teamlead Kernel, SuSE Labs, Research and Development SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg) "Experience is the name everyone gives to their mistakes." -- Oscar Wilde _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems