Yes, the MAC addresses were all updated after the cloning. According to my notes, here are sections of the log files at the time of a fence from each cluster node.
Feb 10 15:17:48 nfs2-cluster clurgmgrd[4280]:<notice> Resource Group Manager Starting Feb 10 15:18:17 nfs2-cluster rgmanager: [7580]:<notice> Shutting down Cluster Service Manager... Feb 10 15:18:17 nfs2-cluster clurgmgrd[4280]:<notice> Shutting down Feb 10 15:18:17 nfs2-cluster clurgmgrd[4280]:<notice> Shutting down Feb 10 15:18:17 nfs2-cluster clurgmgrd[4280]:<notice> Shutdown complete, exiting Feb 10 15:18:17 nfs2-cluster rgmanager: [7580]:<notice> Cluster Service Manager is stopped. Feb 10 15:18:23 nfs2-cluster ccsd[2989]: Stopping ccsd, SIGTERM received. Feb 10 15:18:23 nfs2-cluster NAMC Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading all openais components Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais component: openais_confdb v0 (19/10) Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais component: openais_cpg v0 (18/8) Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais component: openais_cfg v0 (17/7) Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais component: openais_msg v0 (16/6) Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais component: openais_lck v0 (15/5) Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais component: openais_evt v0 (14/4) Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais component: openais_ckpt v0 (13/3) Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais component: openais_amf v0 (12/2) Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais component: openais_clm v0 (11/1) Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais component: openais_evs v0 (10/0) Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais component: openais_cman v0 (9/9) Feb 10 15:18:23 nfs2-cluster gfs_controld[3077]: cluster is down, exiting Feb 10 15:18:23 nfs2-cluster dlm_controld[3071]: cluster is down, exiting Feb 10 15:18:23 nfs2-cluster fenced[3065]: cluster is down, exiting Feb 10 15:18:23 nfs2-cluster kernel: dlm: closing connection to node 2 Feb 10 15:18:23 nfs2-cluster kernel: dlm: closing connection to node 1 Feb 10 15:17:34 nfs1-cluster ntpd[3765]: synchronized to LOCAL(0), stratum 10 Feb 10 15:18:17 nfs1-cluster clurgmgrd[4323]:<notice> Member 2 shutting down Feb 10 15:18:33 nfs1-cluster openais[3046]: [TOTEM] The token was lost in the OPERATIONAL state. Feb 10 15:18:33 nfs1-cluster openais[3046]: [TOTEM] Receive multicast socket recv buffer size (320000 bytes). Feb 10 15:18:33 nfs1-cluster openais[3046]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). Feb 10 15:18:33 nfs1-cluster openais[3046]: [TOTEM] entering GATHER state from 2. Feb 10 15:18:34 nfs1-cluster ntpd[3765]: synchronized to 132.236.56.250, stratum 2 Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] entering GATHER state from 0. Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] Creating commit token because I am the rep. Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] Saving state aru 230 high seq received 230 Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] Storing new sequence id for ring 1f80 Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] entering COMMIT state. Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] entering RECOVERY state. Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] position [0] member 140.90.91.240: Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] previous ring seq 8060 rep 140.90.91.240 Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] aru 230 high delivered 230 received flag 1 Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] Did not need to originate any messages in recovery. Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] Sending initial ORF token Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM ] CLM CONFIGURATION CHANGE Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM ] New Configuration: Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM ] r(0) ip(140.90.91.240) Feb 10 15:18:35 nfs1-cluster kernel: dlm: closing connection to node 2 Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM ] Members Left: Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM ] r(0) ip(140.90.91.242) Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM ] Members Joined: Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM ] CLM CONFIGURATION CHANGE Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM ] New Configuration: Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM ] r(0) ip(140.90.91.240) Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM ] Members Left: Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM ] Members Joined: Feb 10 15:18:35 nfs1-cluster openais[3046]: [SYNC ] This node is within the primary component and will provide service. Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] entering OPERATIONAL state. Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM ] got nodejoin message 140.90.91.240 Feb 10 15:18:35 nfs1-cluster openais[3046]: [CPG ] got joinlist message from node 1 I was seeing a number of these messages but they stopped after upgrading openais nfs2-cluster openais[3012]: [TOTEM] Retransmit List: 1df3 Yes, these are in managed switches. I will try to run the tcpdump asap. Unfortunately, that means I have to have it crash again to get what I need and my users are already annoyed by the downtime we've had. I know this isn't the best solution for our needs, but given the lack of funding, this seemed like a good idea at the time. Thanks for the help! Randy On 02/14/2011 09:03 AM, Digimer wrote:
On 02/14/2011 08:53 AM, Randy Brown wrote:Hello, I am running a 2 node cluster being used as a NAS head for a Lefthand Networks iSCSI SAN to provide NFS mounts out to my network. Things have been OK for a while, but I recently lost one of the nodes as a result of a patching problem. In an effort to recreate the failed node, I imaged the working node and installed that image on the failed node. I set it's hostname and IP settings correctly and the machine booted and joined the cluster just fine. Or at least it appeared so. Things ran OK for the last few weeks, but I recently started seeing a behavior where the nodes start fencing each other. I'm wondering if there is something as a result of cloning the nodes that could be the problem. Possibly something that should be different but isn't because of the cloning? I am running CentOS 5.5 with the following package versions: Kernel - 2.6.18-194.11.3.el5 #1 SMP cman-2.0.115-34.el5_5.4 lvm2-cluster-2.02.56-7.el5_5.4 gfs2-utils-0.1.62-20.el5 kmod-gfs-0.1.34-12.el5.centos rgmanager-2.0.52-6.el5.centos.8 I have a Qlogic qla4062 HBA in the node running: QLogic iSCSI HBA Driver (f8b83000) v5.01.03.04 I will gladly provide more information as needed. Thank you, RandySilly question, but are the NICs mapped to their MAC addresses? If so, did you update the MAC addresses after cloning the server to reflect the actual MAC addresses? Assuming so, do you have managed switches? If so, can you test by swapping out a simple, unmanaged switch? This sounds like a multicast issue at some level. Fencing happens once the totem ring is declared failed. Do you see anything interesting in the log files prior to the fence? Can you run tcpdump to see what is happening on the interface(s) prior to the fence?
<<attachment: randy_brown.vcf>>
-- Linux-cluster mailing list Linux-cluster@redhat.com https://www.redhat.com/mailman/listinfo/linux-cluster