On 25 Jun 2014, at 12:03 am, Lars Ellenberg <lars.ellenb...@linbit.com> wrote:
> On Tue, Jun 24, 2014 at 12:23:30PM +1000, Andrew Beekhof wrote: >> >> On 24 Jun 2014, at 1:52 am, f...@vmware.com wrote: >> >>> Hi, >>> >>> I understand that initially the split-brain is caused by heartbeat >>> messaging layer and there is nothing much can be done when packets are >>> dropped. However, the problem is sometimes when the load is gone (or when >>> iptables allows all traffic in my test setup), it doesn't recover. >>> >>> In the second case I provided, the heartbeat on both nodes did find each >>> other and both were active, but pacemaker in both nodes still thinks peer >>> is offline. I don't know if this is heartbeat's problem or Pacemaker's >>> problem though. >> >> Do you see any messages from 'crmd' saying the node left/returned? >> If you only see the node going away, then its almost certainly a heartbeat >> problem. >> >> You may have better luck with a corosync based cluster, or even a newer >> version of pacemaker (or both! the 1.0.x codebase is quite old at this >> point). >> >> I was never all that happy with heartbeat's membership code, it was a >> near-abandoned mystery box even at the point I started Pacemaker 10 years >> ago. >> Corosync membership had its problems in the beginning, but personally I take >> comfort in the fact that its actively being worked on. >> Opinions differ, but IMHO it surpassed heartbeat for reliability 3-4 years >> ago. > > Possibly. But especially with nodes > "unexpectedly returning after having been declared dead", > I've still seen more problems with corosync than with heartbeat, > even within the last few years. Unfortunately a fair share of those have also been pacemaker bugs :( Yan is working on another one related to slow fencing devices. > > Anyways: > Andrew is right, you should use (recent!) corosync and recent pacemaker. > And working node level fencing aka stonith. > > That said, you said earlier you are using heartbeat 3.0.5, > and that heartbeat successfully re-established membership. > So you can confirm "ccm_testclient" on both nodes reports > the expected and same membership? > > Is that 3.0.5 release tag, or a more "recent" hg checkout? > You need heartbeat up to at least this commit: > http://hg.linux-ha.org/heartbeat-STABLE_3_0/rev/fd1b907a0de6 > > (I meant to add a 3.0.6 release tag since at least I pushed that commit, > but because of packaging inconsistencies I want to fix, > and other commitments, I deferred that much too long). > > -- > : Lars Ellenberg > : LINBIT | Your Way to High Availability > : DRBD/HA support and consulting http://www.linbit.com > > DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems