Re: [Pacemaker] unable to join cluster

2012-03-28 Thread Andrew Beekhof
On Thu, Mar 22, 2012 at 3:07 PM, Hisashi Osanai
 wrote:
>
> Hello,
>
> I have three nodes cluster using pacemaker/corosync. When I reboot one node,
>
> the node unable to join cluster. I can see that kind of split brain 10-20%
> (recall ration) if I shutdown a node.
>
> What do you think of this problem?

It depends whether corosync sees all three nodes (in which case its a
pacemaker problem), if not its a corosync problem.
There are newer versions of both, perhaps try an upgrade?

>
> My questions are:
> - Is this known problem?
> - Any work around to avoid the this?
> - How can I solve this problem?
>
> [testserver001]
> 
> Last updated: Sat Mar 10 14:18:49 2012
> Stack: openais
> Current DC: NONE
> 3 Nodes configured, 3 expected votes
> 4 Resources configured.
> 
>
> OFFLINE: [ testserver001 testserver002 testserver003 ]
>
>
> Migration summary:
>
> [testserver002]
> 
> Last updated: Sat Mar 10 14:15:17 2012
> Stack: openais
> Current DC: testserver002 - partition with quorum
> Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3
> 3 Nodes configured, 3 expected votes
> 4 Resources configured.
> 
>
> Online: [ testserver002 testserver003 ]
> OFFLINE: [ testserver001 ]
>
>  Resource Group: testgroup
>     testrsc     (lsb:testmgr):   Started testserver002
> stonith-testserver002        (stonith:external/ipmi):        Started
> testserver003
> stonith-testserver003        (stonith:external/ipmi):        Started
> testserver002
> stonith-testserver001        (stonith:external/ipmi):        Started
> testserver003
>
> Migration summary:
> * Node testserver003:
> * Node testserver002:
>
> [testserver003]
> 
> Last updated: Sat Mar 10 14:19:07 2012
> Stack: openais
> Current DC: testserver002 - partition with quorum
> Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3
> 3 Nodes configured, 3 expected votes
> 4 Resources configured.
> 
>
> Online: [ testserver002 testserver003 ]
> OFFLINE: [ testserver001 ]
>
>  Resource Group: testgroup
>     testrsc     (lsb:testmgr):   Started testserver002
> stonith-testserver002        (stonith:external/ipmi):        Started
> testserver003
> stonith-testserver003        (stonith:external/ipmi):        Started
> testserver002
> stonith-testserver001        (stonith:external/ipmi):        Started
> testserver003
>
> Migration summary:
> * Node testserver003:
> * Node testserver002:
>
> - Checked information
>  + https://bugzilla.redhat.com/show_bug.cgi?id=525589
>    It looks the packages which I used already support this.
>  + http://comments.gmane.org/gmane.linux.highavailability.user/36101
>    I checked entries in /etc/hosts but I didn't find out the wrong entry.
>    ===
>    127.0.0.1 testserver001 localhost
>    ::1             localhost6.localdomain6 localhost6
>    ===
>
> - Look into this from tcpdump
>  OK case: after MESSAGE_TYPE_ORF_TOKEN received, pacemaker sends
> MESSAGE_TYPE_MCAST.
>           I took the information from VMware env.
>
>    + MESSAGE_TYPE_ORF_TOKEN
>      No.     Time                       Source                Destination
> Protocol Length Info
>          119 2012-03-19 22:00:15.250310 172.27.4.1            172.27.4.2
> UDP      112    Source port: 23489  Destination port: 23490
>
>      Frame 119: 112 bytes on wire (896 bits), 112 bytes captured (896 bits)
>      Ethernet II, Src: Vmware_6b:b9:9a (00:0c:29:6b:b9:9a), Dst:
> Vmware_8e:74:92 (00:0c:29:8e:74:92)
>      Internet Protocol Version 4, Src: 172.27.4.1 (172.27.4.1), Dst:
> 172.27.4.2 (172.27.4.2)
>      User Datagram Protocol, Src Port: 23489 (23489), Dst Port: 23490
> (23490)
>      Data (70 bytes)
>
>        00 00 22 ff ac 1b 04 01 00 00 00 00 0c 00 00 00
> ..".
>      0010  00 00 00 00 00 00 00 00 ac 1b 04 01 02 00 ac 1b
> 
>      (snip)
>
>    + MESSAGE_TYPE_MCAST
>      No.     Time                       Source                Destination
> Protocol Length Info
>         5141 2012-03-19 22:01:19.198346 172.27.4.2            226.94.16.16
> UDP      1486   Source port: 23489  Destination port: 23490
>
>      Frame 5141: 1486 bytes on wire (11888 bits), 1486 bytes captured
> (11888 bits)
>      Ethernet II, Src: Vmware_8e:74:92 (00:0c:29:8e:74:92), Dst:
> IPv4mcast_5e:10:10 (01:00:5e:5e:10:10)
>      Internet Protocol Version 4, Src: 172.27.4.2 (172.27.4.2), Dst:
> 226.94.16.16 (226.94.16.16)
>      User Datagram Protocol, Src Port: 23489 (23489), Dst Port: 23490
> (23490)
>      Data (1444 bytes)
>
>        01 02 22 ff ac 1b 04 02 ac 1b 04 02 02 00 ac 1b
> ..".
>      0010  04 02 08 00 02 00 ac 1b 04 02 08 00 04 00 ac 1b
> 
>      (snip)
>
>  NG case: MESSAGE_TYPE_ORF_TOKEN sent and received repeatedly and I can see
> the
>           message in pacemaker.log.
>
>    + MESSAGE_TYPE_ORF_TOKEN
>      No.     Time                       Source                Destination
> Protocol Length Info
>         39605 2012-03-10 14:18:13.826778 172.2

[Pacemaker] unable to join cluster

2012-03-21 Thread Hisashi Osanai

Hello,

I have three nodes cluster using pacemaker/corosync. When I reboot one node,

the node unable to join cluster. I can see that kind of split brain 10-20% 
(recall ration) if I shutdown a node. 

What do you think of this problem? 

My questions are:
- Is this known problem?
- Any work around to avoid the this?
- How can I solve this problem?

[testserver001]

Last updated: Sat Mar 10 14:18:49 2012
Stack: openais
Current DC: NONE
3 Nodes configured, 3 expected votes
4 Resources configured.


OFFLINE: [ testserver001 testserver002 testserver003 ]


Migration summary:

[testserver002]

Last updated: Sat Mar 10 14:15:17 2012
Stack: openais
Current DC: testserver002 - partition with quorum
Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3
3 Nodes configured, 3 expected votes
4 Resources configured.


Online: [ testserver002 testserver003 ]
OFFLINE: [ testserver001 ]

 Resource Group: testgroup
 testrsc (lsb:testmgr):   Started testserver002
stonith-testserver002(stonith:external/ipmi):Started
testserver003
stonith-testserver003(stonith:external/ipmi):Started
testserver002
stonith-testserver001(stonith:external/ipmi):Started
testserver003

Migration summary:
* Node testserver003:
* Node testserver002:

[testserver003]

Last updated: Sat Mar 10 14:19:07 2012
Stack: openais
Current DC: testserver002 - partition with quorum
Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3
3 Nodes configured, 3 expected votes
4 Resources configured.


Online: [ testserver002 testserver003 ]
OFFLINE: [ testserver001 ]

 Resource Group: testgroup
 testrsc (lsb:testmgr):   Started testserver002
stonith-testserver002(stonith:external/ipmi):Started
testserver003
stonith-testserver003(stonith:external/ipmi):Started
testserver002
stonith-testserver001(stonith:external/ipmi):Started
testserver003

Migration summary:
* Node testserver003:
* Node testserver002:

- Checked information
  + https://bugzilla.redhat.com/show_bug.cgi?id=525589
It looks the packages which I used already support this.
  + http://comments.gmane.org/gmane.linux.highavailability.user/36101
I checked entries in /etc/hosts but I didn't find out the wrong entry.
===
127.0.0.1 testserver001 localhost
::1 localhost6.localdomain6 localhost6
===

- Look into this from tcpdump
  OK case: after MESSAGE_TYPE_ORF_TOKEN received, pacemaker sends
MESSAGE_TYPE_MCAST.
   I took the information from VMware env.
  
+ MESSAGE_TYPE_ORF_TOKEN
  No. Time   SourceDestination
Protocol Length Info
  119 2012-03-19 22:00:15.250310 172.27.4.1172.27.4.2
UDP  112Source port: 23489  Destination port: 23490

  Frame 119: 112 bytes on wire (896 bits), 112 bytes captured (896 bits)
  Ethernet II, Src: Vmware_6b:b9:9a (00:0c:29:6b:b9:9a), Dst:
Vmware_8e:74:92 (00:0c:29:8e:74:92)
  Internet Protocol Version 4, Src: 172.27.4.1 (172.27.4.1), Dst:
172.27.4.2 (172.27.4.2)
  User Datagram Protocol, Src Port: 23489 (23489), Dst Port: 23490
(23490)
  Data (70 bytes)

    00 00 22 ff ac 1b 04 01 00 00 00 00 0c 00 00 00
..".
  0010  00 00 00 00 00 00 00 00 ac 1b 04 01 02 00 ac 1b

  (snip)

+ MESSAGE_TYPE_MCAST
  No. Time   SourceDestination
Protocol Length Info
 5141 2012-03-19 22:01:19.198346 172.27.4.2226.94.16.16
UDP  1486   Source port: 23489  Destination port: 23490

  Frame 5141: 1486 bytes on wire (11888 bits), 1486 bytes captured
(11888 bits)
  Ethernet II, Src: Vmware_8e:74:92 (00:0c:29:8e:74:92), Dst:
IPv4mcast_5e:10:10 (01:00:5e:5e:10:10)
  Internet Protocol Version 4, Src: 172.27.4.2 (172.27.4.2), Dst:
226.94.16.16 (226.94.16.16)
  User Datagram Protocol, Src Port: 23489 (23489), Dst Port: 23490
(23490)
  Data (1444 bytes)

    01 02 22 ff ac 1b 04 02 ac 1b 04 02 02 00 ac 1b
..".
  0010  04 02 08 00 02 00 ac 1b 04 02 08 00 04 00 ac 1b

  (snip)

  NG case: MESSAGE_TYPE_ORF_TOKEN sent and received repeatedly and I can see
the 
   message in pacemaker.log.

+ MESSAGE_TYPE_ORF_TOKEN
  No. Time   SourceDestination
Protocol Length Info
 39605 2012-03-10 14:18:13.826778 172.27.4.2172.27.4.3
UDP  112Source port: 23489  Destination port: 23490

  Frame 39605: 112 bytes on wire (896 bits), 112 bytes captured (896
bits)
  Ethernet II, Src: FujitsuT_98:79:4b (00:19:99:98:79:4b), Dst:
FujitsuT_97:8d:15 (00:19:99:97:8d:15)
  Internet Protocol Version 4, Src: 172.27.4.2 (172.27.4.2), Dst:
172.27.4.3 (172.27.4.3)
  User Datagram Protocol, Src Port: 23489 (23489), Dst Port: 23490
(23490)
  Data (