Re: [Pacemaker] unable to join cluster
On Thu, Mar 22, 2012 at 3:07 PM, Hisashi Osanai wrote: > > Hello, > > I have three nodes cluster using pacemaker/corosync. When I reboot one node, > > the node unable to join cluster. I can see that kind of split brain 10-20% > (recall ration) if I shutdown a node. > > What do you think of this problem? It depends whether corosync sees all three nodes (in which case its a pacemaker problem), if not its a corosync problem. There are newer versions of both, perhaps try an upgrade? > > My questions are: > - Is this known problem? > - Any work around to avoid the this? > - How can I solve this problem? > > [testserver001] > > Last updated: Sat Mar 10 14:18:49 2012 > Stack: openais > Current DC: NONE > 3 Nodes configured, 3 expected votes > 4 Resources configured. > > > OFFLINE: [ testserver001 testserver002 testserver003 ] > > > Migration summary: > > [testserver002] > > Last updated: Sat Mar 10 14:15:17 2012 > Stack: openais > Current DC: testserver002 - partition with quorum > Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3 > 3 Nodes configured, 3 expected votes > 4 Resources configured. > > > Online: [ testserver002 testserver003 ] > OFFLINE: [ testserver001 ] > > Resource Group: testgroup > testrsc (lsb:testmgr): Started testserver002 > stonith-testserver002 (stonith:external/ipmi): Started > testserver003 > stonith-testserver003 (stonith:external/ipmi): Started > testserver002 > stonith-testserver001 (stonith:external/ipmi): Started > testserver003 > > Migration summary: > * Node testserver003: > * Node testserver002: > > [testserver003] > > Last updated: Sat Mar 10 14:19:07 2012 > Stack: openais > Current DC: testserver002 - partition with quorum > Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3 > 3 Nodes configured, 3 expected votes > 4 Resources configured. > > > Online: [ testserver002 testserver003 ] > OFFLINE: [ testserver001 ] > > Resource Group: testgroup > testrsc (lsb:testmgr): Started testserver002 > stonith-testserver002 (stonith:external/ipmi): Started > testserver003 > stonith-testserver003 (stonith:external/ipmi): Started > testserver002 > stonith-testserver001 (stonith:external/ipmi): Started > testserver003 > > Migration summary: > * Node testserver003: > * Node testserver002: > > - Checked information > + https://bugzilla.redhat.com/show_bug.cgi?id=525589 > It looks the packages which I used already support this. > + http://comments.gmane.org/gmane.linux.highavailability.user/36101 > I checked entries in /etc/hosts but I didn't find out the wrong entry. > === > 127.0.0.1 testserver001 localhost > ::1 localhost6.localdomain6 localhost6 > === > > - Look into this from tcpdump > OK case: after MESSAGE_TYPE_ORF_TOKEN received, pacemaker sends > MESSAGE_TYPE_MCAST. > I took the information from VMware env. > > + MESSAGE_TYPE_ORF_TOKEN > No. Time Source Destination > Protocol Length Info > 119 2012-03-19 22:00:15.250310 172.27.4.1 172.27.4.2 > UDP 112 Source port: 23489 Destination port: 23490 > > Frame 119: 112 bytes on wire (896 bits), 112 bytes captured (896 bits) > Ethernet II, Src: Vmware_6b:b9:9a (00:0c:29:6b:b9:9a), Dst: > Vmware_8e:74:92 (00:0c:29:8e:74:92) > Internet Protocol Version 4, Src: 172.27.4.1 (172.27.4.1), Dst: > 172.27.4.2 (172.27.4.2) > User Datagram Protocol, Src Port: 23489 (23489), Dst Port: 23490 > (23490) > Data (70 bytes) > > 00 00 22 ff ac 1b 04 01 00 00 00 00 0c 00 00 00 > ..". > 0010 00 00 00 00 00 00 00 00 ac 1b 04 01 02 00 ac 1b > > (snip) > > + MESSAGE_TYPE_MCAST > No. Time Source Destination > Protocol Length Info > 5141 2012-03-19 22:01:19.198346 172.27.4.2 226.94.16.16 > UDP 1486 Source port: 23489 Destination port: 23490 > > Frame 5141: 1486 bytes on wire (11888 bits), 1486 bytes captured > (11888 bits) > Ethernet II, Src: Vmware_8e:74:92 (00:0c:29:8e:74:92), Dst: > IPv4mcast_5e:10:10 (01:00:5e:5e:10:10) > Internet Protocol Version 4, Src: 172.27.4.2 (172.27.4.2), Dst: > 226.94.16.16 (226.94.16.16) > User Datagram Protocol, Src Port: 23489 (23489), Dst Port: 23490 > (23490) > Data (1444 bytes) > > 01 02 22 ff ac 1b 04 02 ac 1b 04 02 02 00 ac 1b > ..". > 0010 04 02 08 00 02 00 ac 1b 04 02 08 00 04 00 ac 1b > > (snip) > > NG case: MESSAGE_TYPE_ORF_TOKEN sent and received repeatedly and I can see > the > message in pacemaker.log. > > + MESSAGE_TYPE_ORF_TOKEN > No. Time Source Destination > Protocol Length Info > 39605 2012-03-10 14:18:13.826778 172.2
[Pacemaker] unable to join cluster
Hello, I have three nodes cluster using pacemaker/corosync. When I reboot one node, the node unable to join cluster. I can see that kind of split brain 10-20% (recall ration) if I shutdown a node. What do you think of this problem? My questions are: - Is this known problem? - Any work around to avoid the this? - How can I solve this problem? [testserver001] Last updated: Sat Mar 10 14:18:49 2012 Stack: openais Current DC: NONE 3 Nodes configured, 3 expected votes 4 Resources configured. OFFLINE: [ testserver001 testserver002 testserver003 ] Migration summary: [testserver002] Last updated: Sat Mar 10 14:15:17 2012 Stack: openais Current DC: testserver002 - partition with quorum Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3 3 Nodes configured, 3 expected votes 4 Resources configured. Online: [ testserver002 testserver003 ] OFFLINE: [ testserver001 ] Resource Group: testgroup testrsc (lsb:testmgr): Started testserver002 stonith-testserver002(stonith:external/ipmi):Started testserver003 stonith-testserver003(stonith:external/ipmi):Started testserver002 stonith-testserver001(stonith:external/ipmi):Started testserver003 Migration summary: * Node testserver003: * Node testserver002: [testserver003] Last updated: Sat Mar 10 14:19:07 2012 Stack: openais Current DC: testserver002 - partition with quorum Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3 3 Nodes configured, 3 expected votes 4 Resources configured. Online: [ testserver002 testserver003 ] OFFLINE: [ testserver001 ] Resource Group: testgroup testrsc (lsb:testmgr): Started testserver002 stonith-testserver002(stonith:external/ipmi):Started testserver003 stonith-testserver003(stonith:external/ipmi):Started testserver002 stonith-testserver001(stonith:external/ipmi):Started testserver003 Migration summary: * Node testserver003: * Node testserver002: - Checked information + https://bugzilla.redhat.com/show_bug.cgi?id=525589 It looks the packages which I used already support this. + http://comments.gmane.org/gmane.linux.highavailability.user/36101 I checked entries in /etc/hosts but I didn't find out the wrong entry. === 127.0.0.1 testserver001 localhost ::1 localhost6.localdomain6 localhost6 === - Look into this from tcpdump OK case: after MESSAGE_TYPE_ORF_TOKEN received, pacemaker sends MESSAGE_TYPE_MCAST. I took the information from VMware env. + MESSAGE_TYPE_ORF_TOKEN No. Time SourceDestination Protocol Length Info 119 2012-03-19 22:00:15.250310 172.27.4.1172.27.4.2 UDP 112Source port: 23489 Destination port: 23490 Frame 119: 112 bytes on wire (896 bits), 112 bytes captured (896 bits) Ethernet II, Src: Vmware_6b:b9:9a (00:0c:29:6b:b9:9a), Dst: Vmware_8e:74:92 (00:0c:29:8e:74:92) Internet Protocol Version 4, Src: 172.27.4.1 (172.27.4.1), Dst: 172.27.4.2 (172.27.4.2) User Datagram Protocol, Src Port: 23489 (23489), Dst Port: 23490 (23490) Data (70 bytes) 00 00 22 ff ac 1b 04 01 00 00 00 00 0c 00 00 00 ..". 0010 00 00 00 00 00 00 00 00 ac 1b 04 01 02 00 ac 1b (snip) + MESSAGE_TYPE_MCAST No. Time SourceDestination Protocol Length Info 5141 2012-03-19 22:01:19.198346 172.27.4.2226.94.16.16 UDP 1486 Source port: 23489 Destination port: 23490 Frame 5141: 1486 bytes on wire (11888 bits), 1486 bytes captured (11888 bits) Ethernet II, Src: Vmware_8e:74:92 (00:0c:29:8e:74:92), Dst: IPv4mcast_5e:10:10 (01:00:5e:5e:10:10) Internet Protocol Version 4, Src: 172.27.4.2 (172.27.4.2), Dst: 226.94.16.16 (226.94.16.16) User Datagram Protocol, Src Port: 23489 (23489), Dst Port: 23490 (23490) Data (1444 bytes) 01 02 22 ff ac 1b 04 02 ac 1b 04 02 02 00 ac 1b ..". 0010 04 02 08 00 02 00 ac 1b 04 02 08 00 04 00 ac 1b (snip) NG case: MESSAGE_TYPE_ORF_TOKEN sent and received repeatedly and I can see the message in pacemaker.log. + MESSAGE_TYPE_ORF_TOKEN No. Time SourceDestination Protocol Length Info 39605 2012-03-10 14:18:13.826778 172.27.4.2172.27.4.3 UDP 112Source port: 23489 Destination port: 23490 Frame 39605: 112 bytes on wire (896 bits), 112 bytes captured (896 bits) Ethernet II, Src: FujitsuT_98:79:4b (00:19:99:98:79:4b), Dst: FujitsuT_97:8d:15 (00:19:99:97:8d:15) Internet Protocol Version 4, Src: 172.27.4.2 (172.27.4.2), Dst: 172.27.4.3 (172.27.4.3) User Datagram Protocol, Src Port: 23489 (23489), Dst Port: 23490 (23490) Data (