[Linux-cluster] [cman] cant joint cluster after reboot
Hi, I'm trying to set up 3-node cluster (2 nodes + 1 standby node for quorum) with cman+pacemaker stack, everything according this quickstart article: http://clusterlabs.org/quickstart-redhat.html Cluster starts, all nodes see each other, quorum gained, stonith working, but I've run into problem with cman: node cant join cluster after reboot - cman starts and cman_tool nodes reports only that node as cluster-member, while on other 2 nodes it reports 2 nodes as cluster-member and 3rd as offline. cman stop/start/restart on the problem node does no effect - it still can see only itself, but if i'll do cman restart on one of working nodes - everything goes back to normal, all 3 nodes joins the cluster and subsequent cman service restarts on any nodes works fine - node lefts cluster and rejoins sucessfully. But again - only till node OS reboot. For example: [1] Working cluster: [root@node-1 ~]# cman_tool nodes Node Sts Inc Joined Name 1 M592 2013-11-07 15:20:54 node-1.spb.stone.local 2 M760 2013-11-07 15:20:54 node-2.spb.stone.local 3 M760 2013-11-07 15:20:54 vnode-3.spb.stone.local [root@node-1 ~]# cman_tool status Version: 6.2.0 Config Version: 10 Cluster Name: ocluster Cluster Id: 2059 Cluster Member: Yes Cluster Generation: 760 Membership state: Cluster-Member Nodes: 3 Expected votes: 3 Total votes: 3 Node votes: 1 Quorum: 2 Active subsystems: 7 Flags: Ports Bound: 0 Node name: node-1.spb.stone.local Node ID: 1 Multicast addresses: 239.192.8.19 Node addresses: 192.168.220.21 Picture is same on all 3 nodes (except for node name and id) - same cluster name, cluster id, multicast addres. [2] I've put node-1 into reboot. After reboot complete, cman_tool nodes on node-2 and vnode-3 shows this: Node Sts Inc Joined Name 1 X760node-1.spb.stone.local 2 M588 2013-11-07 15:11:23 node-2.spb.stone.local 3 M760 2013-11-07 15:20:54 vnode-3.spb.stone.local [root@node-2 ~]# cman_tool status Version: 6.2.0 Config Version: 10 Cluster Name: ocluster Cluster Id: 2059 Cluster Member: Yes Cluster Generation: 764 Membership state: Cluster-Member Nodes: 2 Expected votes: 3 Total votes: 2 Node votes: 1 Quorum: 2 Active subsystems: 7 Flags: Ports Bound: 0 Node name: node-2.spb.stone.local Node ID: 2 Multicast addresses: 239.192.8.19 Node addresses: 192.168.220.22 But, on rebooted node-1 it shows this: Node Sts Inc Joined Name 1 M764 2013-11-07 15:49:01 node-1.spb.stone.local 2 X 0node-2.spb.stone.local 3 X 0vnode-3.spb.stone.local [root@node-1 ~]# cman_tool status Version: 6.2.0 Config Version: 10 Cluster Name: ocluster Cluster Id: 2059 Cluster Member: Yes Cluster Generation: 776 Membership state: Cluster-Member Nodes: 1 Expected votes: 3 Total votes: 1 Node votes: 1 Quorum: 2 Activity blocked Active subsystems: 7 Flags: Ports Bound: 0 Node name: node-1.spb.stone.local Node ID: 1 Multicast addresses: 239.192.8.19 Node addresses: 192.168.220.21 so, same cluster name, cluster id, multicast address - but it cant see other nodes. And there are nothing in /var/log/messages and /var/log/cluster/corosync.log on other two nodes - they seem not notice node-1 coming back online at all, last records about node-1 leaving cluster. [3] If now i do service cman restart on node-2 or vnode-3 - everything goes back to normal operation as in [1] in logs it shows as node-2 leaving cluster (service stop) and simultaneously joining of both node-2 and node-1 (service start) Nov 7 11:47:06 vnode-3 corosync[26692]: [QUORUM] Members[2]: 2 3 Nov 7 11:47:06 vnode-3 corosync[26692]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Nov 7 11:47:06 vnode-3 kernel: dlm: closing connection to node 1 Nov 7 11:47:06 vnode-3 corosync[26692]: [CPG ] chosen downlist: sender r(0) ip(192.168.220.22) ; members(old:3 left:1) Nov 7 11:47:06 vnode-3 corosync[26692]: [MAIN ] Completed service synchronization, ready to provide service. Nov 7 11:53:28 vnode-3 corosync[26692]: [QUORUM] Members[1]: 3 Nov 7 11:53:28 vnode-3 corosync[26692]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Nov 7 11:53:28 vnode-3 corosync[26692]: [CPG ] chosen downlist: sender r(0) ip(192.168.220.14) ; members(old:2 left:1) Nov 7 11:53:28 vnode-3 corosync[26692]: [MAIN ] Completed service synchronization, ready to provide service. Nov 7 11:53:28 vnode-3 kernel: dlm: closing connection to node 2 Nov 7 11:53:30 vnode-3 corosync[26692]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Nov 7 11:53:30 vnode-3 corosync[26692]: [QUORUM] Members[2]: 1 3 Nov 7 11:53:30 vnode-3 corosync[26692]: [QUORUM] Members[2]: 1 3 Nov 7 11:53:30 vnode-3 corosync[26692]: [QUORUM] Members[3]: 1 2 3 Nov 7
Re: [Linux-cluster] [cman] cant joint cluster after reboot
My understanding is node fenced while rebooting. I suggest you to look info fencing logs as well. If your fencing logs not in detail use following in cluster.conf to enable logging logging logging_daemon name=fenced debug=on/ /logging Thanks On Thu, Nov 7, 2013 at 5:34 PM, Yuriy Demchenko demchenko...@gmail.comwrote: Hi, I'm trying to set up 3-node cluster (2 nodes + 1 standby node for quorum) with cman+pacemaker stack, everything according this quickstart article: http://clusterlabs.org/quickstart-redhat.html Cluster starts, all nodes see each other, quorum gained, stonith working, but I've run into problem with cman: node cant join cluster after reboot - cman starts and cman_tool nodes reports only that node as cluster-member, while on other 2 nodes it reports 2 nodes as cluster-member and 3rd as offline. cman stop/start/restart on the problem node does no effect - it still can see only itself, but if i'll do cman restart on one of working nodes - everything goes back to normal, all 3 nodes joins the cluster and subsequent cman service restarts on any nodes works fine - node lefts cluster and rejoins sucessfully. But again - only till node OS reboot. For example: [1] Working cluster: [root@node-1 ~]# cman_tool nodes Node Sts Inc Joined Name 1 M592 2013-11-07 15:20:54 node-1.spb.stone.local 2 M760 2013-11-07 15:20:54 node-2.spb.stone.local 3 M760 2013-11-07 15:20:54 vnode-3.spb.stone.local [root@node-1 ~]# cman_tool status Version: 6.2.0 Config Version: 10 Cluster Name: ocluster Cluster Id: 2059 Cluster Member: Yes Cluster Generation: 760 Membership state: Cluster-Member Nodes: 3 Expected votes: 3 Total votes: 3 Node votes: 1 Quorum: 2 Active subsystems: 7 Flags: Ports Bound: 0 Node name: node-1.spb.stone.local Node ID: 1 Multicast addresses: 239.192.8.19 Node addresses: 192.168.220.21 Picture is same on all 3 nodes (except for node name and id) - same cluster name, cluster id, multicast addres. [2] I've put node-1 into reboot. After reboot complete, cman_tool nodes on node-2 and vnode-3 shows this: Node Sts Inc Joined Name 1 X760node-1.spb.stone.local 2 M588 2013-11-07 15:11:23 node-2.spb.stone.local 3 M760 2013-11-07 15:20:54 vnode-3.spb.stone.local [root@node-2 ~]# cman_tool status Version: 6.2.0 Config Version: 10 Cluster Name: ocluster Cluster Id: 2059 Cluster Member: Yes Cluster Generation: 764 Membership state: Cluster-Member Nodes: 2 Expected votes: 3 Total votes: 2 Node votes: 1 Quorum: 2 Active subsystems: 7 Flags: Ports Bound: 0 Node name: node-2.spb.stone.local Node ID: 2 Multicast addresses: 239.192.8.19 Node addresses: 192.168.220.22 But, on rebooted node-1 it shows this: Node Sts Inc Joined Name 1 M764 2013-11-07 15:49:01 node-1.spb.stone.local 2 X 0node-2.spb.stone.local 3 X 0vnode-3.spb.stone.local [root@node-1 ~]# cman_tool status Version: 6.2.0 Config Version: 10 Cluster Name: ocluster Cluster Id: 2059 Cluster Member: Yes Cluster Generation: 776 Membership state: Cluster-Member Nodes: 1 Expected votes: 3 Total votes: 1 Node votes: 1 Quorum: 2 Activity blocked Active subsystems: 7 Flags: Ports Bound: 0 Node name: node-1.spb.stone.local Node ID: 1 Multicast addresses: 239.192.8.19 Node addresses: 192.168.220.21 so, same cluster name, cluster id, multicast address - but it cant see other nodes. And there are nothing in /var/log/messages and /var/log/cluster/corosync.log on other two nodes - they seem not notice node-1 coming back online at all, last records about node-1 leaving cluster. [3] If now i do service cman restart on node-2 or vnode-3 - everything goes back to normal operation as in [1] in logs it shows as node-2 leaving cluster (service stop) and simultaneously joining of both node-2 and node-1 (service start) Nov 7 11:47:06 vnode-3 corosync[26692]: [QUORUM] Members[2]: 2 3 Nov 7 11:47:06 vnode-3 corosync[26692]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Nov 7 11:47:06 vnode-3 kernel: dlm: closing connection to node 1 Nov 7 11:47:06 vnode-3 corosync[26692]: [CPG ] chosen downlist: sender r(0) ip(192.168.220.22) ; members(old:3 left:1) Nov 7 11:47:06 vnode-3 corosync[26692]: [MAIN ] Completed service synchronization, ready to provide service. Nov 7 11:53:28 vnode-3 corosync[26692]: [QUORUM] Members[1]: 3 Nov 7 11:53:28 vnode-3 corosync[26692]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Nov 7 11:53:28 vnode-3 corosync[26692]: [CPG ] chosen downlist: sender r(0) ip(192.168.220.14) ; members(old:2 left:1) Nov 7 11:53:28 vnode-3 corosync[26692]: [MAIN ] Completed service synchronization, ready to
Re: [Linux-cluster] [cman] cant joint cluster after reboot
Nope, nothing in logs suggests that node is fenced while in reboot. Moreover, same behaviour persists with pacemaker started - and I've explicitly put node into standby in pacemaker before reboot. And same behaviour persists with stonith-enabled=false; same behaviour with manual node fence via stonith_admin --reboot node-1.spb.stone.local. So i suppose fencing isn't issue here. Yuriy Demchenko On 11/07/2013 05:11 PM, Vishesh kumar wrote: My understanding is node fenced while rebooting. I suggest you to look info fencing logs as well. If your fencing logs not in detail use following in cluster.conf to enable logging logging logging_daemon name=fenced debug=on/ /logging Thanks On Thu, Nov 7, 2013 at 5:34 PM, Yuriy Demchenko demchenko...@gmail.com mailto:demchenko...@gmail.com wrote: Hi, I'm trying to set up 3-node cluster (2 nodes + 1 standby node for quorum) with cman+pacemaker stack, everything according this quickstart article: http://clusterlabs.org/quickstart-redhat.html Cluster starts, all nodes see each other, quorum gained, stonith working, but I've run into problem with cman: node cant join cluster after reboot - cman starts and cman_tool nodes reports only that node as cluster-member, while on other 2 nodes it reports 2 nodes as cluster-member and 3rd as offline. cman stop/start/restart on the problem node does no effect - it still can see only itself, but if i'll do cman restart on one of working nodes - everything goes back to normal, all 3 nodes joins the cluster and subsequent cman service restarts on any nodes works fine - node lefts cluster and rejoins sucessfully. But again - only till node OS reboot. For example: [1] Working cluster: [root@node-1 ~]# cman_tool nodes Node Sts Inc Joined Name 1 M592 2013-11-07 15:20:54 node-1.spb.stone.local 2 M760 2013-11-07 15:20:54 node-2.spb.stone.local 3 M760 2013-11-07 15:20:54 vnode-3.spb.stone.local [root@node-1 ~]# cman_tool status Version: 6.2.0 Config Version: 10 Cluster Name: ocluster Cluster Id: 2059 Cluster Member: Yes Cluster Generation: 760 Membership state: Cluster-Member Nodes: 3 Expected votes: 3 Total votes: 3 Node votes: 1 Quorum: 2 Active subsystems: 7 Flags: Ports Bound: 0 Node name: node-1.spb.stone.local Node ID: 1 Multicast addresses: 239.192.8.19 Node addresses: 192.168.220.21 Picture is same on all 3 nodes (except for node name and id) - same cluster name, cluster id, multicast addres. [2] I've put node-1 into reboot. After reboot complete, cman_tool nodes on node-2 and vnode-3 shows this: Node Sts Inc Joined Name 1 X760 node-1.spb.stone.local 2 M588 2013-11-07 15:11:23 node-2.spb.stone.local 3 M760 2013-11-07 15:20:54 vnode-3.spb.stone.local [root@node-2 ~]# cman_tool status Version: 6.2.0 Config Version: 10 Cluster Name: ocluster Cluster Id: 2059 Cluster Member: Yes Cluster Generation: 764 Membership state: Cluster-Member Nodes: 2 Expected votes: 3 Total votes: 2 Node votes: 1 Quorum: 2 Active subsystems: 7 Flags: Ports Bound: 0 Node name: node-2.spb.stone.local Node ID: 2 Multicast addresses: 239.192.8.19 Node addresses: 192.168.220.22 But, on rebooted node-1 it shows this: Node Sts Inc Joined Name 1 M764 2013-11-07 15:49:01 node-1.spb.stone.local 2 X 0 node-2.spb.stone.local 3 X 0 vnode-3.spb.stone.local [root@node-1 ~]# cman_tool status Version: 6.2.0 Config Version: 10 Cluster Name: ocluster Cluster Id: 2059 Cluster Member: Yes Cluster Generation: 776 Membership state: Cluster-Member Nodes: 1 Expected votes: 3 Total votes: 1 Node votes: 1 Quorum: 2 Activity blocked Active subsystems: 7 Flags: Ports Bound: 0 Node name: node-1.spb.stone.local Node ID: 1 Multicast addresses: 239.192.8.19 Node addresses: 192.168.220.21 so, same cluster name, cluster id, multicast address - but it cant see other nodes. And there are nothing in /var/log/messages and /var/log/cluster/corosync.log on other two nodes - they seem not notice node-1 coming back online at all, last records about node-1 leaving cluster. [3] If now i do service cman restart on node-2 or vnode-3 - everything goes back to normal operation as in [1] in logs it shows as node-2 leaving
Re: [Linux-cluster] [cman] cant joint cluster after reboot
On 07/11/13 12:04, Yuriy Demchenko wrote: Hi, I'm trying to set up 3-node cluster (2 nodes + 1 standby node for quorum) with cman+pacemaker stack, everything according this quickstart article: http://clusterlabs.org/quickstart-redhat.html Cluster starts, all nodes see each other, quorum gained, stonith working, but I've run into problem with cman: node cant join cluster after reboot - cman starts and cman_tool nodes reports only that node as cluster-member, while on other 2 nodes it reports 2 nodes as cluster-member and 3rd as offline. cman stop/start/restart on the problem node does no effect - it still can see only itself, but if i'll do cman restart on one of working nodes - everything goes back to normal, all 3 nodes joins the cluster and subsequent cman service restarts on any nodes works fine - node lefts cluster and rejoins sucessfully. But again - only till node OS reboot. For example: [1] Working cluster: [root@node-1 ~]# cman_tool nodes Node Sts Inc Joined Name 1 M592 2013-11-07 15:20:54 node-1.spb.stone.local 2 M760 2013-11-07 15:20:54 node-2.spb.stone.local 3 M760 2013-11-07 15:20:54 vnode-3.spb.stone.local [root@node-1 ~]# cman_tool status Version: 6.2.0 Config Version: 10 Cluster Name: ocluster Cluster Id: 2059 Cluster Member: Yes Cluster Generation: 760 Membership state: Cluster-Member Nodes: 3 Expected votes: 3 Total votes: 3 Node votes: 1 Quorum: 2 Active subsystems: 7 Flags: Ports Bound: 0 Node name: node-1.spb.stone.local Node ID: 1 Multicast addresses: 239.192.8.19 Node addresses: 192.168.220.21 Picture is same on all 3 nodes (except for node name and id) - same cluster name, cluster id, multicast addres. [2] I've put node-1 into reboot. After reboot complete, cman_tool nodes on node-2 and vnode-3 shows this: Node Sts Inc Joined Name 1 X760node-1.spb.stone.local 2 M588 2013-11-07 15:11:23 node-2.spb.stone.local 3 M760 2013-11-07 15:20:54 vnode-3.spb.stone.local [root@node-2 ~]# cman_tool status Version: 6.2.0 Config Version: 10 Cluster Name: ocluster Cluster Id: 2059 Cluster Member: Yes Cluster Generation: 764 Membership state: Cluster-Member Nodes: 2 Expected votes: 3 Total votes: 2 Node votes: 1 Quorum: 2 Active subsystems: 7 Flags: Ports Bound: 0 Node name: node-2.spb.stone.local Node ID: 2 Multicast addresses: 239.192.8.19 Node addresses: 192.168.220.22 But, on rebooted node-1 it shows this: Node Sts Inc Joined Name 1 M764 2013-11-07 15:49:01 node-1.spb.stone.local 2 X 0node-2.spb.stone.local 3 X 0vnode-3.spb.stone.local [root@node-1 ~]# cman_tool status Version: 6.2.0 Config Version: 10 Cluster Name: ocluster Cluster Id: 2059 Cluster Member: Yes Cluster Generation: 776 Membership state: Cluster-Member Nodes: 1 Expected votes: 3 Total votes: 1 Node votes: 1 Quorum: 2 Activity blocked Active subsystems: 7 Flags: Ports Bound: 0 Node name: node-1.spb.stone.local Node ID: 1 Multicast addresses: 239.192.8.19 Node addresses: 192.168.220.21 so, same cluster name, cluster id, multicast address - but it cant see other nodes. And there are nothing in /var/log/messages and /var/log/cluster/corosync.log on other two nodes - they seem not notice node-1 coming back online at all, last records about node-1 leaving cluster. [3] If now i do service cman restart on node-2 or vnode-3 - everything goes back to normal operation as in [1] in logs it shows as node-2 leaving cluster (service stop) and simultaneously joining of both node-2 and node-1 (service start) Nov 7 11:47:06 vnode-3 corosync[26692]: [QUORUM] Members[2]: 2 3 Nov 7 11:47:06 vnode-3 corosync[26692]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Nov 7 11:47:06 vnode-3 kernel: dlm: closing connection to node 1 Nov 7 11:47:06 vnode-3 corosync[26692]: [CPG ] chosen downlist: sender r(0) ip(192.168.220.22) ; members(old:3 left:1) Nov 7 11:47:06 vnode-3 corosync[26692]: [MAIN ] Completed service synchronization, ready to provide service. Nov 7 11:53:28 vnode-3 corosync[26692]: [QUORUM] Members[1]: 3 Nov 7 11:53:28 vnode-3 corosync[26692]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Nov 7 11:53:28 vnode-3 corosync[26692]: [CPG ] chosen downlist: sender r(0) ip(192.168.220.14) ; members(old:2 left:1) Nov 7 11:53:28 vnode-3 corosync[26692]: [MAIN ] Completed service synchronization, ready to provide service. Nov 7 11:53:28 vnode-3 kernel: dlm: closing connection to node 2 Nov 7 11:53:30 vnode-3 corosync[26692]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Nov 7 11:53:30 vnode-3 corosync[26692]: [QUORUM] Members[2]: 1 3 Nov 7 11:53:30 vnode-3 corosync[26692]: [QUORUM] Members[2]: 1 3 Nov 7 11:53:30 vnode-3 corosync[26692]: [QUORUM] Members[3]: 1 2 3
Re: [Linux-cluster] Adding a node back to cluster failing
On 07/11/13 13:35 +0800, Zama Ques wrote: A. The host %s is already a member of cluster %s or B. %s is already a member of a cluster named %s or some other message (interpolate %s above with your values)? But the node name is not there in cluster.conf file. you are talking about the original surviving node now, right? It seems more likely to me that B. from above is right, so then, please make sure there is no /etc/cluster/cluster.conf on the node you are trying to add (may be a leftover from the original setup of this recovered node). Thanks Jan, you assumed correctly. Deleting /etc/cluster/cluster.conf actually resolved the issue. Node is successfully added back to cluster. Great to hear that :) The problem there was that we distinguish two situations in luci: a. node is associated with the cluster (as per clusternode entry in cluster.conf), however not an active member of the cluster, (due to cman service not running there as e.g., when the node has been a subject of leave cluster action in luci) - this node is explicitly listed amongst the cluster nodes with not a cluster member status and from here, it can be selected for join cluster action, which will start cman + rgmanager on that node again, leading to a cluster membership (if nothing goes wrong) - in this case the node is expected (enforced) to carry cluster.conf (which is also being updated throughout the changes in cluster configuration as long as ricci is run there and cluster.conf is not deleted manually in between) b. node is not a priori associated with the cluster (it's not mentioned in the cluster.conf across the cluster) - this node is apparently not (no hint for that) listed amongst the cluster nodes and can be added via add from that view - in this case the node is expected *not* to contain cluster.conf upon being added to an existing cluster; simply because by mistake, an attempt to add a node being already a member of a different cluster could be made, possibly leading to some inconsistencies in the luci view of the cluster NB: we might tighten this constraint and allow the cluster configuration to be already present on the node being added provided that the cluster name matches the destination (which would help in this very case, IMHO) Apparently this was case of b. and the above description explains why removing cluster.conf from the node to be added was of help. It's up to consideration if to provide a solution to such class of cases as suggested, feel free to comment on: https://bugzilla.redhat.com/show_bug.cgi?id=1028092 note: I mentioned some other little discrepancies I've discovered there -- Jan -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster