On Mon, Apr 30, 2012 at 12:23:56PM +1000, Andrew Beekhof wrote:
> On Sat, Apr 28, 2012 at 12:11 AM, Lars Ellenberg
> <lars.ellenb...@linbit.com> wrote:
> > On Thu, Apr 26, 2012 at 10:56:30AM +0900, renayama19661...@ybb.ne.jp wrote:
> >> Hi All,
> >>
> >> We gave test that assumed remote cluster environment.
> >> And we tested packet lost.
> >
> > You may be interested in this patch I have lying around for ages.
> >
> > It may be incomplete for one corner case:
> > On a seriously misconfigured and overloaded system,
> > I have seen reports for a single send_local_status()
> > (that is basically one single send_cluster_msg())
> > which took longer to execute than deadtime
> > (without even returning to the mainloop!).
> >
> > This cornercase should be handled with a watchdog.
> > But without a watchdog, and without stonith,
> > the CCM was confused, because one node saw a
> > leave then re-join after partition event, while the other node did not
> > even notice it had left and rejoined the membership...
> > and pacemaker ended up being DC on both :-/
> 
> A side effect of the ccm being "really confused" I assume?

I guess so, yes.
Not sure if pacemaker could have handled it differently,
based on the input it was fed from ccm.

At some point, Pacemaker complained about "Another DC detected",
but things never really recovered.
If I can dig up the logs again, I'll show you some lines.

But then, no stonith, no watchdog, and system overloaded to the point
where processing a single mainloop dispatch callback takes longer 
than what is supposed to be the deadtime, and that is within the
heartbeat communication main processes, which are supposed to be
realtime...

I don't think additional paranoia code would do much good,
on any level.

But thanks for your attention ;-)


> >
> > So I guess send_local_status() could do with an explicit call to
> > check_for_timeouts(), but that may need recursion protection.
> >
> >
> > I should really polish and push my queue some day soon...
> >
> > Cheers,
> >
> >
> > diff --git a/heartbeat/hb_rexmit.c b/heartbeat/hb_rexmit.c
...

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Reply via email to