Hi Andrew, I do see the last status update from crmd as following on node-1 from crmd is but crm_mon -1 still shows node-0 offline: crmd_ha_status_callback: Status update: Node node-0 now has status [active] [DC=false] Same on node-0 showing node-1 now has status active but crm_mon -1 shows it offline.
Thanks, -Kaiwei ----- Original Message ----- From: "Andrew Beekhof" <and...@beekhof.net> To: "General Linux-HA mailing list" <linux-ha@lists.linux-ha.org> Sent: Monday, June 23, 2014 7:23:30 PM Subject: Re: [Linux-HA] unable to recover from split-brain in a two-node cluster On 24 Jun 2014, at 1:52 am, f...@vmware.com wrote: > Hi, > > I understand that initially the split-brain is caused by heartbeat messaging > layer and there is nothing much can be done when packets are dropped. > However, the problem is sometimes when the load is gone (or when iptables > allows all traffic in my test setup), it doesn't recover. > > In the second case I provided, the heartbeat on both nodes did find each > other and both were active, but pacemaker in both nodes still thinks peer is > offline. I don't know if this is heartbeat's problem or Pacemaker's problem > though. Do you see any messages from 'crmd' saying the node left/returned? If you only see the node going away, then its almost certainly a heartbeat problem. You may have better luck with a corosync based cluster, or even a newer version of pacemaker (or both! the 1.0.x codebase is quite old at this point). I was never all that happy with heartbeat's membership code, it was a near-abandoned mystery box even at the point I started Pacemaker 10 years ago. Corosync membership had its problems in the beginning, but personally I take comfort in the fact that its actively being worked on. Opinions differ, but IMHO it surpassed heartbeat for reliability 3-4 years ago. > > Thanks, > -Kaiwei > > ----- Original Message ----- > From: "Andrew Beekhof" <and...@beekhof.net> > To: "General Linux-HA mailing list" <linux-ha@lists.linux-ha.org> > Sent: Sunday, June 22, 2014 3:45:00 PM > Subject: Re: [Linux-HA] unable to recover from split-brain in a two-node > cluster > > > On 21 Jun 2014, at 5:18 am, f...@vmware.com wrote: > >> Hi, >> >> New to this list and hope I can get some help here. >> >> I'm using pacemaker 1.0.10 and heartbeat 3.0.5 for a two-node cluster. I'm >> having split-brain problem when heartbeat messages sometimes get dropped >> when system is under high load. However the problem is it never recover back >> when system load became low. >> >> I created a test setup to test this by setting dead time to 6 seconds, and >> continuously dropping one-way heartbeat packets (udp dst port 694) for 5~8 >> seconds and resume the traffic for 1~2 seconds using iptables. After the >> system got into split-brain state, I stop the test and allow all heartbeat >> traffic to go through. Sometimes the system recovered by sometimes it >> didn't. There are various symptoms when the system didn't recovered from >> split-brain: >> >> 1. In one instance, cl_status listnodes becomes empty. The syslog keeps >> showing >> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.warning] [2853]: WARN: >> Message hist queue is filling up (436 messages in queue) >> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: >> hist->ackseq =12111 >> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: >> hist->lowseq =12111, hist->hiseq=12547 >> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: >> expecting from node-1 >> 2014-06-19T18:59:57+00:00 node-0 heartbeat: [daemon.debug] [2853]: debug: >> it's ackseq=12111 >> >> 2. In another instance, cl_status nodestatus <node> shows both nodes are >> active, but "crm_mon -1" shows that each of the two nodes thinks itself is >> the DC, and peer node is offline. Pengine process is running on one node >> only. The node not running pengine (but still thinks itself is DC) has log >> shows crmd terminated pengine because it detected peer is active. After >> that, the peer status keeps flapping between dead and active, but pengine >> has never being started again. The last log shows the peer is active (after >> I stopped the test and allow all traffic). However "crm_mon -1" shows itself >> is the DC and peer is offline as: >> >> [root@node-1 ~]# crm_mon -1 >> ============ >> Last updated: Fri Jun 20 19:12:23 2014 >> Stack: Heartbeat >> Current DC: node-1 (bf053fc5-5afd-b483-2ad2-3c9fc354f7fa) - partition with >> quorum >> Version: 1.0.9-da7075976b5ff0bee71074385f8fd02f296ec8a3 >> 2 Nodes configured, unknown expected votes >> 1 Resources configured. >> ============ >> >> Online: [ node-1 ] >> OFFLINE: [ node-0 ] >> >> cluster (heartbeat:ha): Started node-1 >> >> >> Any help, like pointer to the source code where the problem might be, or any >> existing bug filed for this (I did some search but didn't find matched >> symptoms) is appreciated. > > This is happening at the heartbeat level. > > Not much pacemaker can do I'm afraid. Perhaps look to see if heartbeat is > "real time" scheduled, if not that may explain why its being staved of CPU > and can't get its messages out. > >> >> Thanks, >> -Kaiwei >> _______________________________________________ >> Linux-HA mailing list >> Linux-HA@lists.linux-ha.org >> https://urldefense.proofpoint.com/v1/url?u=http://lists.linux-ha.org/mailman/listinfo/linux-ha&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=wkHclXpXSRj4jN%2FF6yECfGjmY217uMGOFlJmf%2B1H4FA%3D%0A&s=1816bc839d2eb1e28a3d00afaecf7d0ad1eb371fc314b0acf875b0c3e6c9add8 >> See also: >> https://urldefense.proofpoint.com/v1/url?u=http://linux-ha.org/ReportingProblems&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=wkHclXpXSRj4jN%2FF6yECfGjmY217uMGOFlJmf%2B1H4FA%3D%0A&s=03ff7c6b602a98e907a05bd086150d61cfaefe6e06fe60ac881fee79077a76f6 > > > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > https://urldefense.proofpoint.com/v1/url?u=http://lists.linux-ha.org/mailman/listinfo/linux-ha&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=wkHclXpXSRj4jN%2FF6yECfGjmY217uMGOFlJmf%2B1H4FA%3D%0A&s=1816bc839d2eb1e28a3d00afaecf7d0ad1eb371fc314b0acf875b0c3e6c9add8 > See also: > https://urldefense.proofpoint.com/v1/url?u=http://linux-ha.org/ReportingProblems&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=wkHclXpXSRj4jN%2FF6yECfGjmY217uMGOFlJmf%2B1H4FA%3D%0A&s=03ff7c6b602a98e907a05bd086150d61cfaefe6e06fe60ac881fee79077a76f6 > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > https://urldefense.proofpoint.com/v1/url?u=http://lists.linux-ha.org/mailman/listinfo/linux-ha&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=Bdkr%2BWD1D90gppOzcux8fbsp%2FFuhQA2GbfnQDXlt1Tk%3D%0A&s=f202ffc293e834a940e48dbc394ae6df2a3670b777a8421229f25614349bb7fa > See also: > https://urldefense.proofpoint.com/v1/url?u=http://linux-ha.org/ReportingProblems&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=Bdkr%2BWD1D90gppOzcux8fbsp%2FFuhQA2GbfnQDXlt1Tk%3D%0A&s=c511b9a0478ebacd5bcd14658a2609c36d360788340f19a2ff5a561a4fd0c016 _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org https://urldefense.proofpoint.com/v1/url?u=http://lists.linux-ha.org/mailman/listinfo/linux-ha&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=Bdkr%2BWD1D90gppOzcux8fbsp%2FFuhQA2GbfnQDXlt1Tk%3D%0A&s=f202ffc293e834a940e48dbc394ae6df2a3670b777a8421229f25614349bb7fa See also: https://urldefense.proofpoint.com/v1/url?u=http://linux-ha.org/ReportingProblems&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=9IPI1z37RqWr21klX9jnPw%3D%3D%0A&m=Bdkr%2BWD1D90gppOzcux8fbsp%2FFuhQA2GbfnQDXlt1Tk%3D%0A&s=c511b9a0478ebacd5bcd14658a2609c36d360788340f19a2ff5a561a4fd0c016 _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems