thijn wrote: > Hi, > > I have the following problem. > CMAN: removing node [server1] from the cluster : Missed too many > heartbeats > When the server comes back up: > Feb 10 14:43:58 server1 kernel: CMAN: sending membership request > after which it will try to join until the end of times. > > In the current problem, server2 is active and server1 has the problem > not being able to join the cluster. > > The setup is a two server setup cluster. > We have had the problem on several clusters. > We "fixed" it usualy with rebooting the other node at which the cluster > would repair itself and all ran smoothly from thereon. > Naturally this will disrupt any services running on the cluster. And its > not really a solution that will win prices. > The problem is that server1(the problem one) is in a inquorate state and > we are unable to get it to a quorate state, neither do we see why this > is the case. > We tried to use a test setup to replay the problem, we were unable. > > So we decided to try to find a way to fix the state of the cluster using > the tools the system provides. > > The problem we see presents itself after a fence action by either node. > When we would bring down both nodes to stabilize the issue, the cluster > would become healthy and after that we can reboot either node and it > will rejoin the cluster. > It seems the problem presents itself when "pulling the plug" out of the > server. > We run on IBM Xservers using the SA-adapter as a fence device. > The fence device is in a different subnet then the subnet on which the > cluster communicates. > Bot fence devices are on the same subnet/vlan. > > CentOS release 4.6 (Final) > Linux server2 2.6.9-67.ELsmp #1 SMP Fri Nov 16 12:48:03 EST 2007 i686 > i686 i386 GNU/Linux > cman_tool 1.0.17 (built Mar 20 2007 17:10:52) > Copyright (C) Red Hat, Inc. 2004 All rights reserved. > > All versions of libraries and packages, kernel modules and all that is > dependent for the GFS cluster to operate are identical on both nodes. > > Cluster.conf > [r...@server1 log]# cat /etc/cluster/cluster.conf > <?xml version="1.0"?> > <cluster config_version="3" name="NAME_cluster"> > <fence_daemon post_fail_delay="0" post_join_delay="3"/> > <clusternodes> > <clusternode name="server1.production.loc" votes="1"> > <fence> > <method name="1"> > <device name="saserver1"/> > </method> > </fence> > </clusternode> > <clusternode name="server2.production.loc" votes="1"> > <fence> > <method name="1"> > <device name="saserver2"/> > </method> > </fence> > </clusternode> > </clusternodes> > <cman expected_votes="1" two_node="1"/> > <fencedevices> > <fencedevice agent="fence_rsa" ipaddr="10.13.110.114" login="saadapter" > name="saserver1" passwd="XXXXXXX"/> > <fencedevice agent="fence_rsa" ipaddr="10.13.110.115" login="saadapter" > name="saserver2" passwd="XXXXXXX"/> > </fencedevices> > <rm> > <failoverdomains/> > <resources/> > </rm> > </cluster> > > [r...@server1 log]# cat /etc/hosts > 127.0.0.1 localhost.localdomain localhost > > Both server are able to ping each other and also the broadcast address, > so there is no firewall filtering UDP packets > When i tcpdump the line i see traffic going both ways, > > Both servers are in the same vlan > 14:51:28.703240 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto > 17, length: 56) server2.production.loc.6809 > > broadcast.production.loc.6809: UDP, length 28 > 14:51:28.703277 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto > 17, length: 140) server1.production.loc.6809 > > server2.production.loc.6809: UDP, length 112 > 14:51:33.703240 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto > 17, length: 56) server2.production.loc.6809 > > broadcast.production.loc.6809: UDP, length 28 > 14:51:33.703310 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto > 17, length: 140) server1.production.loc.6809 > > server2.production.loc.6809.6809: UDP, length 112 > > Is this normal network behavior when a cluster is inquorate? > I see that server1 is talking to server2, but server2 is only talking in > broadcasts. > > When i start of try to join the cluster > Feb 10 09:36:06 server1 cman: cman_tool: Node is already active failed > > [r...@server1 ~]# cman_tool status > Protocol version: 5.0.1 > Config version: 3 > Cluster name: NAME_cluster > Cluster ID: 64692 > Cluster Member: No > Membership state: Joining > > [r...@server2 log]# cman_tool status > Protocol version: 5.0.1 > Config version: 3 > Cluster name: RWSEems_cluster > Cluster ID: 64692 > Cluster Member: Yes > Membership state: Cluster-Member > Nodes: 1 > Expected_votes: 1 > Total_votes: 1 > Quorum: 1 > Active subsystems: 7 > Node name: server2.production.loc > Node ID: 2 > Node addresses: server1.production.loc > > [r...@server1 ~]# cman_tool nodes > Node Votes Exp Sts Name > > [r...@server2 log]# cman_tool nodes > Node Votes Exp Sts Name > 1 1 1 X server1.production.loc > 2 1 1 M server2.production.loc > > When i start cman > service cman start > > Feb 10 14:06:30 server1 kernel: CMAN: Waiting to join or form a > Linux-cluster > Feb 10 14:06:30 server1 ccsd[21964]: Connected to cluster infrastruture > via: CMAN/SM Plugin v1.1.7.4 > Feb 10 14:06:30 server1 ccsd[21964]: Initial status:: Inquorate > > > It seems to me that this should be fixable with the tools as provided > with the RedHat Cluster Suite, without disturbing the running cluster. > It seems quite insane if i need to restart my cluster to have it all > working again.. kinda spoils the idea of running a cluster. > This setup is running in a HA envirmoment and we can have nearly to no > downtime. > > The logs on the healthy server (server2) does not mention/complain > anything of errors when rebooting, restarting cman or when server1 want > to join the cluster. > We see no disallowed, refused or anything that server2 is not willing to > play with server1 > > I have been looking at this thing for a while now.. am i missing > anything? >
This is a known bug, see https://bugzilla.redhat.com/show_bug.cgi?id=475293 It's fixed in 4.7 or you can run a program to set up a workaround. Having said that I have heard reports of is still happening in some circumstances ... but I don't have any more detail -- Chrissie -- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster