Re: [Linux-HA] unable to recover from split-brain in a two-node cluster

fank Tue, 24 Jun 2014 08:22:52 -0700

Hi Andrew,

I do see the last status update from crmd as following on node-1 from crmd is 
but crm_mon -1 still shows node-0 offline:
crmd_ha_status_callback: Status update: Node node-0 now has status [active] 
[DC=false]
Same on node-0 showing node-1 now has status active but crm_mon -1 shows it 
offline.


Thanks,
-Kaiwei

----- Original Message -----
From: "Andrew Beekhof" <and...@beekhof.net>
To: "General Linux-HA mailing list" <linux-ha@lists.linux-ha.org>
Sent: Monday, June 23, 2014 7:23:30 PM
Subject: Re: [Linux-HA] unable to recover from split-brain in a two-node        
cluster


On 24 Jun 2014, at 1:52 am, f...@vmware.com wrote:

> Hi,
> 
> I understand that initially the split-brain is caused by heartbeat messaging 
> layer and there is nothing much can be done when packets are dropped. 
> However, the problem is sometimes when the load is gone (or when iptables 
> allows all traffic in my test setup), it doesn't recover.
> 
> In the second case I provided, the heartbeat on both nodes did find each 
> other and both were active, but pacemaker in both nodes still thinks peer is 
> offline. I don't know if this is heartbeat's problem or Pacemaker's problem 
> though.

Do you see any messages from 'crmd' saying the node left/returned?
If you only see the node going away, then its almost certainly a heartbeat 
problem.

You may have better luck with a corosync based cluster, or even a newer version 
of pacemaker (or both! the 1.0.x codebase is quite old at this point).

I was never all that happy with heartbeat's membership code, it was a 
near-abandoned mystery box even at the point I started Pacemaker 10 years ago.
Corosync membership had its problems in the beginning, but personally I take 
comfort in the fact that its actively being worked on.
Opinions differ, but IMHO it surpassed heartbeat for reliability 3-4 years ago.

> 
> Thanks,
> -Kaiwei
> 
> ----- Original Message -----
> From: "Andrew Beekhof" <and...@beekhof.net>
> To: "General Linux-HA mailing list" <linux-ha@lists.linux-ha.org>
> Sent: Sunday, June 22, 2014 3:45:00 PM
> Subject: Re: [Linux-HA] unable to recover from split-brain in a two-node      
> cluster
> 
> 
> On 21 Jun 2014, at 5:18 am, f...@vmware.com wrote:
> 
>> Hi,
>> 
>> New to this list and hope I can get some help here.
>> 
>> I'm using pacemaker 1.0.10 and heartbeat 3.0.5 for a two-node cluster. I'm 
>> having split-brain problem when heartbeat messages sometimes get dropped 
>> when system is under high load. However the problem is it never recover back 
>> when system load became low.
>> 
>> I created a test setup to test this by setting dead time to 6 seconds, and 
>> continuously dropping one-way heartbeat packets (udp dst port 694) for 5~8 
>> seconds and resume the traffic for 1~2 seconds using iptables. After the 
>> system got into split-brain state, I stop the test and allow all heartbeat 
>> traffic to go through. Sometimes the system recovered by sometimes it 
>> didn't. There are various symptoms when the system didn't recovered from 
>> split-brain:
>> 
>> 1. In one instance, cl_status listnodes becomes empty. The syslog keeps 
>> showing
>> 2014-06-19T18:59:57+00:00 node-0 heartbeat:  [daemon.warning] [2853]: WARN: 
>> Message hist queue is filling up (436 messages in queue)
>> 2014-06-19T18:59:57+00:00 node-0 heartbeat:  [daemon.debug] [2853]: debug: 
>> hist->ackseq =12111
>> 2014-06-19T18:59:57+00:00 node-0 heartbeat:  [daemon.debug] [2853]: debug: 
>> hist->lowseq =12111, hist->hiseq=12547
>> 2014-06-19T18:59:57+00:00 node-0 heartbeat:  [daemon.debug] [2853]: debug: 
>> expecting from node-1
>> 2014-06-19T18:59:57+00:00 node-0 heartbeat:  [daemon.debug] [2853]: debug: 
>> it's ackseq=12111
>> 
>> 2. In another instance, cl_status nodestatus <node> shows both nodes are 
>> active, but "crm_mon -1" shows that each of the two nodes thinks itself is 
>> the DC, and peer node is offline. Pengine process is running on one node 
>> only. The node not running pengine (but still thinks itself is DC) has log 
>> shows crmd terminated pengine because it detected peer is active. After 
>> that, the peer status keeps flapping between dead and active, but pengine 
>> has never being started again. The last log shows the peer is active (after 
>> I stopped the test and allow all traffic). However "crm_mon -1" shows itself 
>> is the DC and peer is offline as:
>> 
>> [root@node-1 ~]# crm_mon -1
>> ============
>> Last updated: Fri Jun 20 19:12:23 2014
>> Stack: Heartbeat
>> Current DC: node-1 (bf053fc5-5afd-b483-2ad2-3c9fc354f7fa) - partition with 
>> quorum
>> Version: 1.0.9-da7075976b5ff0bee71074385f8fd02f296ec8a3
>> 2 Nodes configured, unknown expected votes
>> 1 Resources configured.
>> ============
>> 
>> Online: [ node-1 ]
>> OFFLINE: [ node-0 ]
>> 
>> cluster     (heartbeat:ha):      Started node-1
>> 
>> 
>> Any help, like pointer to the source code where the problem might be, or any 
>> existing bug filed for this (I did some search but didn't find matched 
>> symptoms) is appreciated.
> 
> This is happening at the heartbeat level.
> 
> Not much pacemaker can do I'm afraid.  Perhaps look to see if heartbeat is 
> "real time" scheduled, if not that may explain why its being staved of CPU 
> and can't get its messages out.
> 
>> 
>> Thanks,
>> -Kaiwei
>> _______________________________________________
>> Linux-HA mailing list
>> Linux-HA@lists.linux-ha.org
>> https://urldefense.proofpoint.com/v1/url?u=http://lists.linux-ha.org/mailman/listinfo/linux-ha&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=wkHclXpXSRj4jN%2FF6yECfGjmY217uMGOFlJmf%2B1H4FA%3D%0A&s=1816bc839d2eb1e28a3d00afaecf7d0ad1eb371fc314b0acf875b0c3e6c9add8
>> See also: 
>> https://urldefense.proofpoint.com/v1/url?u=http://linux-ha.org/ReportingProblems&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=wkHclXpXSRj4jN%2FF6yECfGjmY217uMGOFlJmf%2B1H4FA%3D%0A&s=03ff7c6b602a98e907a05bd086150d61cfaefe6e06fe60ac881fee79077a76f6
> 
> 
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> https://urldefense.proofpoint.com/v1/url?u=http://lists.linux-ha.org/mailman/listinfo/linux-ha&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=wkHclXpXSRj4jN%2FF6yECfGjmY217uMGOFlJmf%2B1H4FA%3D%0A&s=1816bc839d2eb1e28a3d00afaecf7d0ad1eb371fc314b0acf875b0c3e6c9add8
> See also: 
> https://urldefense.proofpoint.com/v1/url?u=http://linux-ha.org/ReportingProblems&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=wkHclXpXSRj4jN%2FF6yECfGjmY217uMGOFlJmf%2B1H4FA%3D%0A&s=03ff7c6b602a98e907a05bd086150d61cfaefe6e06fe60ac881fee79077a76f6
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> https://urldefense.proofpoint.com/v1/url?u=http://lists.linux-ha.org/mailman/listinfo/linux-ha&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=Bdkr%2BWD1D90gppOzcux8fbsp%2FFuhQA2GbfnQDXlt1Tk%3D%0A&s=f202ffc293e834a940e48dbc394ae6df2a3670b777a8421229f25614349bb7fa
> See also: 
> https://urldefense.proofpoint.com/v1/url?u=http://linux-ha.org/ReportingProblems&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=Bdkr%2BWD1D90gppOzcux8fbsp%2FFuhQA2GbfnQDXlt1Tk%3D%0A&s=c511b9a0478ebacd5bcd14658a2609c36d360788340f19a2ff5a561a4fd0c016


_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
https://urldefense.proofpoint.com/v1/url?u=http://lists.linux-ha.org/mailman/listinfo/linux-ha&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=Bdkr%2BWD1D90gppOzcux8fbsp%2FFuhQA2GbfnQDXlt1Tk%3D%0A&s=f202ffc293e834a940e48dbc394ae6df2a3670b777a8421229f25614349bb7fa
See also: 
https://urldefense.proofpoint.com/v1/url?u=http://linux-ha.org/ReportingProblems&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=Bdkr%2BWD1D90gppOzcux8fbsp%2FFuhQA2GbfnQDXlt1Tk%3D%0A&s=c511b9a0478ebacd5bcd14658a2609c36d360788340f19a2ff5a561a4fd0c016
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] unable to recover from split-brain in a two-node cluster

Reply via email to