[Linux-cluster] [cman] cant joint cluster after reboot

2013-11-07 Thread Yuriy Demchenko

Hi,

I'm trying to set up 3-node cluster (2 nodes + 1 standby node for 
quorum) with cman+pacemaker stack, everything according this quickstart 
article: http://clusterlabs.org/quickstart-redhat.html


Cluster starts, all nodes see each other, quorum gained, stonith 
working, but I've run into problem with cman: node cant join cluster 
after reboot - cman starts and cman_tool nodes reports only that node as 
cluster-member, while on other 2 nodes it reports 2 nodes as 
cluster-member and 3rd as offline. cman stop/start/restart on the 
problem node does no effect - it still can see only itself, but if i'll 
do cman restart on one of working nodes - everything goes back to 
normal, all 3 nodes joins the cluster and subsequent cman service 
restarts on any nodes works fine - node lefts cluster and rejoins 
sucessfully. But again - only till node OS reboot.


For example:
[1] Working cluster:

[root@node-1 ~]# cman_tool nodes
Node  Sts   Inc   Joined   Name
   1   M592   2013-11-07 15:20:54  node-1.spb.stone.local
   2   M760   2013-11-07 15:20:54  node-2.spb.stone.local
   3   M760   2013-11-07 15:20:54  vnode-3.spb.stone.local
[root@node-1 ~]# cman_tool status
Version: 6.2.0
Config Version: 10
Cluster Name: ocluster
Cluster Id: 2059
Cluster Member: Yes
Cluster Generation: 760
Membership state: Cluster-Member
Nodes: 3
Expected votes: 3
Total votes: 3
Node votes: 1
Quorum: 2
Active subsystems: 7
Flags:
Ports Bound: 0
Node name: node-1.spb.stone.local
Node ID: 1
Multicast addresses: 239.192.8.19
Node addresses: 192.168.220.21
Picture is same on all 3 nodes (except for node name and id) - same 
cluster name, cluster id, multicast addres.


[2] I've put node-1 into reboot. After reboot complete, cman_tool 
nodes on node-2 and vnode-3 shows this:

Node  Sts   Inc   Joined   Name
   1   X760node-1.spb.stone.local
   2   M588   2013-11-07 15:11:23  node-2.spb.stone.local
   3   M760   2013-11-07 15:20:54  vnode-3.spb.stone.local
[root@node-2 ~]# cman_tool status
Version: 6.2.0
Config Version: 10
Cluster Name: ocluster
Cluster Id: 2059
Cluster Member: Yes
Cluster Generation: 764
Membership state: Cluster-Member
Nodes: 2
Expected votes: 3
Total votes: 2
Node votes: 1
Quorum: 2
Active subsystems: 7
Flags:
Ports Bound: 0
Node name: node-2.spb.stone.local
Node ID: 2
Multicast addresses: 239.192.8.19
Node addresses: 192.168.220.22

But, on rebooted node-1 it shows this:

Node  Sts   Inc   Joined   Name
   1   M764   2013-11-07 15:49:01  node-1.spb.stone.local
   2   X  0node-2.spb.stone.local
   3   X  0vnode-3.spb.stone.local
[root@node-1 ~]# cman_tool status
Version: 6.2.0
Config Version: 10
Cluster Name: ocluster
Cluster Id: 2059
Cluster Member: Yes
Cluster Generation: 776
Membership state: Cluster-Member
Nodes: 1
Expected votes: 3
Total votes: 1
Node votes: 1
Quorum: 2 Activity blocked
Active subsystems: 7
Flags:
Ports Bound: 0
Node name: node-1.spb.stone.local
Node ID: 1
Multicast addresses: 239.192.8.19
Node addresses: 192.168.220.21
so, same cluster name, cluster id, multicast address - but it cant see 
other nodes. And there are nothing in /var/log/messages and 
/var/log/cluster/corosync.log on other two nodes - they seem not notice 
node-1 coming back online at all, last records about node-1 leaving cluster.


[3] If now i do service cman restart on node-2 or vnode-3 - everything 
goes back to normal operation as in [1]
in logs it shows as node-2 leaving cluster (service stop) and 
simultaneously joining of both node-2 and node-1 (service start)

Nov  7 11:47:06 vnode-3 corosync[26692]: [QUORUM] Members[2]: 2 3
Nov  7 11:47:06 vnode-3 corosync[26692]:   [TOTEM ] A processor joined 
or left the membership and a new membership was formed.

Nov  7 11:47:06 vnode-3 kernel: dlm: closing connection to node 1
Nov  7 11:47:06 vnode-3 corosync[26692]:   [CPG   ] chosen downlist: 
sender r(0) ip(192.168.220.22) ; members(old:3 left:1)
Nov  7 11:47:06 vnode-3 corosync[26692]:   [MAIN  ] Completed service 
synchronization, ready to provide service.

Nov  7 11:53:28 vnode-3 corosync[26692]:   [QUORUM] Members[1]: 3
Nov  7 11:53:28 vnode-3 corosync[26692]:   [TOTEM ] A processor joined 
or left the membership and a new membership was formed.
Nov  7 11:53:28 vnode-3 corosync[26692]:   [CPG   ] chosen downlist: 
sender r(0) ip(192.168.220.14) ; members(old:2 left:1)
Nov  7 11:53:28 vnode-3 corosync[26692]:   [MAIN  ] Completed service 
synchronization, ready to provide service.

Nov  7 11:53:28 vnode-3 kernel: dlm: closing connection to node 2
Nov  7 11:53:30 vnode-3 corosync[26692]:   [TOTEM ] A processor joined 
or left the membership and a new membership was formed.

Nov  7 11:53:30 vnode-3 corosync[26692]:   [QUORUM] Members[2]: 1 3
Nov  7 11:53:30 vnode-3 corosync[26692]:   [QUORUM] Members[2]: 1 3
Nov  7 11:53:30 vnode-3 corosync[26692]:   [QUORUM] Members[3]: 1 2 3
Nov  7 

Re: [Linux-cluster] [cman] cant joint cluster after reboot

2013-11-07 Thread Vishesh kumar
My understanding is node fenced while rebooting. I suggest you to look info
fencing logs as well. If your fencing logs not in detail use following in
cluster.conf to enable logging

logging
 logging_daemon name=fenced debug=on/
  /logging


Thanks


On Thu, Nov 7, 2013 at 5:34 PM, Yuriy Demchenko demchenko...@gmail.comwrote:

 Hi,

 I'm trying to set up 3-node cluster (2 nodes + 1 standby node for quorum)
 with cman+pacemaker stack, everything according this quickstart article:
 http://clusterlabs.org/quickstart-redhat.html

 Cluster starts, all nodes see each other, quorum gained, stonith working,
 but I've run into problem with cman: node cant join cluster after reboot -
 cman starts and cman_tool nodes reports only that node as cluster-member,
 while on other 2 nodes it reports 2 nodes as cluster-member and 3rd as
 offline. cman stop/start/restart on the problem node does no effect - it
 still can see only itself, but if i'll do cman restart on one of working
 nodes - everything goes back to normal, all 3 nodes joins the cluster and
 subsequent cman service restarts on any nodes works fine - node lefts
 cluster and rejoins sucessfully. But again - only till node OS reboot.

 For example:
 [1] Working cluster:

 [root@node-1 ~]# cman_tool nodes
 Node  Sts   Inc   Joined   Name
1   M592   2013-11-07 15:20:54  node-1.spb.stone.local
2   M760   2013-11-07 15:20:54  node-2.spb.stone.local
3   M760   2013-11-07 15:20:54  vnode-3.spb.stone.local
 [root@node-1 ~]# cman_tool status
 Version: 6.2.0
 Config Version: 10
 Cluster Name: ocluster
 Cluster Id: 2059
 Cluster Member: Yes
 Cluster Generation: 760
 Membership state: Cluster-Member
 Nodes: 3
 Expected votes: 3
 Total votes: 3
 Node votes: 1
 Quorum: 2
 Active subsystems: 7
 Flags:
 Ports Bound: 0
 Node name: node-1.spb.stone.local
 Node ID: 1
 Multicast addresses: 239.192.8.19
 Node addresses: 192.168.220.21

 Picture is same on all 3 nodes (except for node name and id) - same
 cluster name, cluster id, multicast addres.

 [2] I've put node-1 into reboot. After reboot complete, cman_tool nodes
 on node-2 and vnode-3 shows this:

 Node  Sts   Inc   Joined   Name
1   X760node-1.spb.stone.local
2   M588   2013-11-07 15:11:23  node-2.spb.stone.local
3   M760   2013-11-07 15:20:54  vnode-3.spb.stone.local
 [root@node-2 ~]# cman_tool status
 Version: 6.2.0
 Config Version: 10
 Cluster Name: ocluster
 Cluster Id: 2059
 Cluster Member: Yes
 Cluster Generation: 764
 Membership state: Cluster-Member
 Nodes: 2
 Expected votes: 3
 Total votes: 2
 Node votes: 1
 Quorum: 2
 Active subsystems: 7
 Flags:
 Ports Bound: 0
 Node name: node-2.spb.stone.local
 Node ID: 2
 Multicast addresses: 239.192.8.19
 Node addresses: 192.168.220.22

 But, on rebooted node-1 it shows this:

 Node  Sts   Inc   Joined   Name
1   M764   2013-11-07 15:49:01  node-1.spb.stone.local
2   X  0node-2.spb.stone.local
3   X  0vnode-3.spb.stone.local
 [root@node-1 ~]# cman_tool status
 Version: 6.2.0
 Config Version: 10
 Cluster Name: ocluster
 Cluster Id: 2059
 Cluster Member: Yes
 Cluster Generation: 776
 Membership state: Cluster-Member
 Nodes: 1
 Expected votes: 3
 Total votes: 1
 Node votes: 1
 Quorum: 2 Activity blocked
 Active subsystems: 7
 Flags:
 Ports Bound: 0
 Node name: node-1.spb.stone.local
 Node ID: 1
 Multicast addresses: 239.192.8.19
 Node addresses: 192.168.220.21

 so, same cluster name, cluster id, multicast address - but it cant see
 other nodes. And there are nothing in /var/log/messages and
 /var/log/cluster/corosync.log on other two nodes - they seem not notice
 node-1 coming back online at all, last records about node-1 leaving cluster.

 [3] If now i do service cman restart on node-2 or vnode-3 - everything
 goes back to normal operation as in [1]
 in logs it shows as node-2 leaving cluster (service stop) and
 simultaneously joining of both node-2 and node-1 (service start)

 Nov  7 11:47:06 vnode-3 corosync[26692]: [QUORUM] Members[2]: 2 3
 Nov  7 11:47:06 vnode-3 corosync[26692]:   [TOTEM ] A processor joined or
 left the membership and a new membership was formed.
 Nov  7 11:47:06 vnode-3 kernel: dlm: closing connection to node 1
 Nov  7 11:47:06 vnode-3 corosync[26692]:   [CPG   ] chosen downlist:
 sender r(0) ip(192.168.220.22) ; members(old:3 left:1)
 Nov  7 11:47:06 vnode-3 corosync[26692]:   [MAIN  ] Completed service
 synchronization, ready to provide service.
 Nov  7 11:53:28 vnode-3 corosync[26692]:   [QUORUM] Members[1]: 3
 Nov  7 11:53:28 vnode-3 corosync[26692]:   [TOTEM ] A processor joined or
 left the membership and a new membership was formed.
 Nov  7 11:53:28 vnode-3 corosync[26692]:   [CPG   ] chosen downlist:
 sender r(0) ip(192.168.220.14) ; members(old:2 left:1)
 Nov  7 11:53:28 vnode-3 corosync[26692]:   [MAIN  ] Completed service
 synchronization, ready to 

Re: [Linux-cluster] [cman] cant joint cluster after reboot

2013-11-07 Thread Yuriy Demchenko
Nope, nothing in logs suggests that node is fenced while in reboot. 
Moreover, same behaviour persists with pacemaker started - and I've 
explicitly put node into standby in pacemaker before reboot.
And same behaviour persists with stonith-enabled=false; same behaviour 
with manual node fence via stonith_admin --reboot 
node-1.spb.stone.local. So i suppose fencing isn't issue here.


Yuriy Demchenko

On 11/07/2013 05:11 PM, Vishesh kumar wrote:
My understanding is node fenced while rebooting. I suggest you to look 
info fencing logs as well. If your fencing logs not in detail use 
following in cluster.conf to enable logging


logging
  logging_daemon name=fenced debug=on/
   /logging

Thanks


On Thu, Nov 7, 2013 at 5:34 PM, Yuriy Demchenko 
demchenko...@gmail.com mailto:demchenko...@gmail.com wrote:


Hi,

I'm trying to set up 3-node cluster (2 nodes + 1 standby node for
quorum) with cman+pacemaker stack, everything according this
quickstart article: http://clusterlabs.org/quickstart-redhat.html

Cluster starts, all nodes see each other, quorum gained, stonith
working, but I've run into problem with cman: node cant join
cluster after reboot - cman starts and cman_tool nodes reports
only that node as cluster-member, while on other 2 nodes it
reports 2 nodes as cluster-member and 3rd as offline. cman
stop/start/restart on the problem node does no effect - it still
can see only itself, but if i'll do cman restart on one of working
nodes - everything goes back to normal, all 3 nodes joins the
cluster and subsequent cman service restarts on any nodes works
fine - node lefts cluster and rejoins sucessfully. But again -
only till node OS reboot.

For example:
[1] Working cluster:

[root@node-1 ~]# cman_tool nodes
Node  Sts   Inc   Joined   Name
   1   M592   2013-11-07 15:20:54  node-1.spb.stone.local
   2   M760   2013-11-07 15:20:54  node-2.spb.stone.local
   3   M760   2013-11-07 15:20:54  vnode-3.spb.stone.local
[root@node-1 ~]# cman_tool status
Version: 6.2.0
Config Version: 10
Cluster Name: ocluster
Cluster Id: 2059
Cluster Member: Yes
Cluster Generation: 760
Membership state: Cluster-Member
Nodes: 3
Expected votes: 3
Total votes: 3
Node votes: 1
Quorum: 2
Active subsystems: 7
Flags:
Ports Bound: 0
Node name: node-1.spb.stone.local
Node ID: 1
Multicast addresses: 239.192.8.19
Node addresses: 192.168.220.21

Picture is same on all 3 nodes (except for node name and id) -
same cluster name, cluster id, multicast addres.

[2] I've put node-1 into reboot. After reboot complete, cman_tool
nodes on node-2 and vnode-3 shows this:

Node  Sts   Inc   Joined   Name
   1   X760  node-1.spb.stone.local
   2   M588   2013-11-07 15:11:23  node-2.spb.stone.local
   3   M760   2013-11-07 15:20:54  vnode-3.spb.stone.local
[root@node-2 ~]# cman_tool status
Version: 6.2.0
Config Version: 10
Cluster Name: ocluster
Cluster Id: 2059
Cluster Member: Yes
Cluster Generation: 764
Membership state: Cluster-Member
Nodes: 2
Expected votes: 3
Total votes: 2
Node votes: 1
Quorum: 2
Active subsystems: 7
Flags:
Ports Bound: 0
Node name: node-2.spb.stone.local
Node ID: 2
Multicast addresses: 239.192.8.19
Node addresses: 192.168.220.22

But, on rebooted node-1 it shows this:

Node  Sts   Inc   Joined   Name
   1   M764   2013-11-07 15:49:01  node-1.spb.stone.local
   2   X  0  node-2.spb.stone.local
   3   X  0  vnode-3.spb.stone.local
[root@node-1 ~]# cman_tool status
Version: 6.2.0
Config Version: 10
Cluster Name: ocluster
Cluster Id: 2059
Cluster Member: Yes
Cluster Generation: 776
Membership state: Cluster-Member
Nodes: 1
Expected votes: 3
Total votes: 1
Node votes: 1
Quorum: 2 Activity blocked
Active subsystems: 7
Flags:
Ports Bound: 0
Node name: node-1.spb.stone.local
Node ID: 1
Multicast addresses: 239.192.8.19
Node addresses: 192.168.220.21

so, same cluster name, cluster id, multicast address - but it cant
see other nodes. And there are nothing in /var/log/messages and
/var/log/cluster/corosync.log on other two nodes - they seem not
notice node-1 coming back online at all, last records about node-1
leaving cluster.

[3] If now i do service cman restart on node-2 or vnode-3 -
everything goes back to normal operation as in [1]
in logs it shows as node-2 leaving 

Re: [Linux-cluster] [cman] cant joint cluster after reboot

2013-11-07 Thread Christine Caulfield

On 07/11/13 12:04, Yuriy Demchenko wrote:

Hi,

I'm trying to set up 3-node cluster (2 nodes + 1 standby node for
quorum) with cman+pacemaker stack, everything according this quickstart
article: http://clusterlabs.org/quickstart-redhat.html

Cluster starts, all nodes see each other, quorum gained, stonith
working, but I've run into problem with cman: node cant join cluster
after reboot - cman starts and cman_tool nodes reports only that node as
cluster-member, while on other 2 nodes it reports 2 nodes as
cluster-member and 3rd as offline. cman stop/start/restart on the
problem node does no effect - it still can see only itself, but if i'll
do cman restart on one of working nodes - everything goes back to
normal, all 3 nodes joins the cluster and subsequent cman service
restarts on any nodes works fine - node lefts cluster and rejoins
sucessfully. But again - only till node OS reboot.

For example:
[1] Working cluster:

[root@node-1 ~]# cman_tool nodes
Node  Sts   Inc   Joined   Name
   1   M592   2013-11-07 15:20:54  node-1.spb.stone.local
   2   M760   2013-11-07 15:20:54  node-2.spb.stone.local
   3   M760   2013-11-07 15:20:54  vnode-3.spb.stone.local
[root@node-1 ~]# cman_tool status
Version: 6.2.0
Config Version: 10
Cluster Name: ocluster
Cluster Id: 2059
Cluster Member: Yes
Cluster Generation: 760
Membership state: Cluster-Member
Nodes: 3
Expected votes: 3
Total votes: 3
Node votes: 1
Quorum: 2
Active subsystems: 7
Flags:
Ports Bound: 0
Node name: node-1.spb.stone.local
Node ID: 1
Multicast addresses: 239.192.8.19
Node addresses: 192.168.220.21

Picture is same on all 3 nodes (except for node name and id) - same
cluster name, cluster id, multicast addres.

[2] I've put node-1 into reboot. After reboot complete, cman_tool
nodes on node-2 and vnode-3 shows this:

Node  Sts   Inc   Joined   Name
   1   X760node-1.spb.stone.local
   2   M588   2013-11-07 15:11:23  node-2.spb.stone.local
   3   M760   2013-11-07 15:20:54  vnode-3.spb.stone.local
[root@node-2 ~]# cman_tool status
Version: 6.2.0
Config Version: 10
Cluster Name: ocluster
Cluster Id: 2059
Cluster Member: Yes
Cluster Generation: 764
Membership state: Cluster-Member
Nodes: 2
Expected votes: 3
Total votes: 2
Node votes: 1
Quorum: 2
Active subsystems: 7
Flags:
Ports Bound: 0
Node name: node-2.spb.stone.local
Node ID: 2
Multicast addresses: 239.192.8.19
Node addresses: 192.168.220.22

But, on rebooted node-1 it shows this:

Node  Sts   Inc   Joined   Name
   1   M764   2013-11-07 15:49:01  node-1.spb.stone.local
   2   X  0node-2.spb.stone.local
   3   X  0vnode-3.spb.stone.local
[root@node-1 ~]# cman_tool status
Version: 6.2.0
Config Version: 10
Cluster Name: ocluster
Cluster Id: 2059
Cluster Member: Yes
Cluster Generation: 776
Membership state: Cluster-Member
Nodes: 1
Expected votes: 3
Total votes: 1
Node votes: 1
Quorum: 2 Activity blocked
Active subsystems: 7
Flags:
Ports Bound: 0
Node name: node-1.spb.stone.local
Node ID: 1
Multicast addresses: 239.192.8.19
Node addresses: 192.168.220.21

so, same cluster name, cluster id, multicast address - but it cant see
other nodes. And there are nothing in /var/log/messages and
/var/log/cluster/corosync.log on other two nodes - they seem not notice
node-1 coming back online at all, last records about node-1 leaving
cluster.

[3] If now i do service cman restart on node-2 or vnode-3 - everything
goes back to normal operation as in [1]
in logs it shows as node-2 leaving cluster (service stop) and
simultaneously joining of both node-2 and node-1 (service start)

Nov  7 11:47:06 vnode-3 corosync[26692]: [QUORUM] Members[2]: 2 3
Nov  7 11:47:06 vnode-3 corosync[26692]:   [TOTEM ] A processor joined
or left the membership and a new membership was formed.
Nov  7 11:47:06 vnode-3 kernel: dlm: closing connection to node 1
Nov  7 11:47:06 vnode-3 corosync[26692]:   [CPG   ] chosen downlist:
sender r(0) ip(192.168.220.22) ; members(old:3 left:1)
Nov  7 11:47:06 vnode-3 corosync[26692]:   [MAIN  ] Completed service
synchronization, ready to provide service.
Nov  7 11:53:28 vnode-3 corosync[26692]:   [QUORUM] Members[1]: 3
Nov  7 11:53:28 vnode-3 corosync[26692]:   [TOTEM ] A processor joined
or left the membership and a new membership was formed.
Nov  7 11:53:28 vnode-3 corosync[26692]:   [CPG   ] chosen downlist:
sender r(0) ip(192.168.220.14) ; members(old:2 left:1)
Nov  7 11:53:28 vnode-3 corosync[26692]:   [MAIN  ] Completed service
synchronization, ready to provide service.
Nov  7 11:53:28 vnode-3 kernel: dlm: closing connection to node 2
Nov  7 11:53:30 vnode-3 corosync[26692]:   [TOTEM ] A processor joined
or left the membership and a new membership was formed.
Nov  7 11:53:30 vnode-3 corosync[26692]:   [QUORUM] Members[2]: 1 3
Nov  7 11:53:30 vnode-3 corosync[26692]:   [QUORUM] Members[2]: 1 3
Nov  7 11:53:30 vnode-3 corosync[26692]:   [QUORUM] Members[3]: 1 2 3

Re: [Linux-cluster] Adding a node back to cluster failing

2013-11-07 Thread Jan Pokorný
On 07/11/13 13:35 +0800, Zama Ques wrote:
 A. The host %s is already a member of cluster %s
 
 or
 
 B. %s is already a member of a cluster named %s
 
 or some other message (interpolate %s above with your values)?
 
 
 But the node name is not there in cluster.conf file.
 
 you are talking about the original surviving node now, right?
 
 It seems more likely to me that B. from above is right, so then,
 please make sure there is no /etc/cluster/cluster.conf on the node you
 are trying to add (may be a leftover from the original setup of this
 recovered node).
 
 Thanks Jan, you assumed correctly.  
 
 Deleting /etc/cluster/cluster.conf actually resolved the issue.
 Node is successfully added back to cluster.

Great to hear that :)


The problem there was that we distinguish two situations in luci:

a. node is associated with the cluster (as per clusternode entry
   in cluster.conf), however not an active member of the cluster,
   (due to cman service not running there as e.g., when the node
   has been a subject of leave cluster action in luci)

   - this node is explicitly listed amongst the cluster nodes
 with not a cluster member status and from here, it
 can be selected for join cluster action, which will
 start cman + rgmanager on that node again, leading
 to a cluster membership (if nothing goes wrong)

   - in this case the node is expected (enforced) to carry
 cluster.conf (which is also being updated throughout the changes
 in cluster configuration as long as ricci is run there
 and cluster.conf is not deleted manually in between)

b. node is not a priori associated with the cluster (it's not mentioned
   in the cluster.conf across the cluster)

   - this node is apparently not (no hint for that) listed amongst
 the cluster nodes and can be added via add from that view

   - in this case the node is expected *not* to contain cluster.conf
 upon being added to an existing cluster; simply because by
 mistake, an attempt to add a node being already a member of
 a different cluster could be made, possibly leading to some
 inconsistencies in the luci view of the cluster
 NB: we might tighten this constraint and allow the cluster
 configuration to be already present on the node being
 added provided that the cluster name matches the destination
 (which would help in this very case, IMHO)


Apparently this was case of b. and the above description explains
why removing cluster.conf from the node to be added was of help.

It's up to consideration if to provide a solution to such class
of cases as suggested, feel free to comment on:

https://bugzilla.redhat.com/show_bug.cgi?id=1028092

note: I mentioned some other little discrepancies I've discovered
  there

-- 
Jan

-- 
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster