Re: [Linux-cluster] Problem with machines fencing one another in 2 Node NFS cluster

Randy Brown Thu, 24 Feb 2011 05:51:23 -0800

Thanks for the response. Sorry for the delay. I had an issue that,unexpectedly, took me away from the office. I am just getting back tothis now.

Yes, the MAC addresses were all updated after the cloning. According tomy notes, here are sections of the log files at the time of a fence fromeach cluster node.


Feb 10 15:17:48 nfs2-cluster clurgmgrd[4280]:<notice>  Resource Group Manager 
Starting
Feb 10 15:18:17 nfs2-cluster rgmanager: [7580]:<notice>  Shutting down Cluster 
Service Manager...
Feb 10 15:18:17 nfs2-cluster clurgmgrd[4280]:<notice>  Shutting down
Feb 10 15:18:17 nfs2-cluster clurgmgrd[4280]:<notice>  Shutting down
Feb 10 15:18:17 nfs2-cluster clurgmgrd[4280]:<notice>  Shutdown complete, 
exiting
Feb 10 15:18:17 nfs2-cluster rgmanager: [7580]:<notice>  Cluster Service 
Manager is stopped.
Feb 10 15:18:23 nfs2-cluster ccsd[2989]: Stopping ccsd, SIGTERM received.
Feb 10 15:18:23 nfs2-cluster NAMC
Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading all openais 
components
Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais 
component: openais_confdb v0 (19/10)
Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais 
component: openais_cpg v0 (18/8)
Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais 
component: openais_cfg v0 (17/7)
Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais 
component: openais_msg v0 (16/6)
Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais 
component: openais_lck v0 (15/5)
Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais 
component: openais_evt v0 (14/4)
Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais 
component: openais_ckpt v0 (13/3)
Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais 
component: openais_amf v0 (12/2)
Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais 
component: openais_clm v0 (11/1)
Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais 
component: openais_evs v0 (10/0)
Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais 
component: openais_cman v0 (9/9)
Feb 10 15:18:23 nfs2-cluster gfs_controld[3077]: cluster is down, exiting
Feb 10 15:18:23 nfs2-cluster dlm_controld[3071]: cluster is down, exiting
Feb 10 15:18:23 nfs2-cluster fenced[3065]: cluster is down, exiting
Feb 10 15:18:23 nfs2-cluster kernel: dlm: closing connection to node 2
Feb 10 15:18:23 nfs2-cluster kernel: dlm: closing connection to node 1


Feb 10 15:17:34 nfs1-cluster ntpd[3765]: synchronized to LOCAL(0), stratum 10
Feb 10 15:18:17 nfs1-cluster clurgmgrd[4323]:<notice>  Member 2 shutting down
Feb 10 15:18:33 nfs1-cluster openais[3046]: [TOTEM] The token was lost in the 
OPERATIONAL state.
Feb 10 15:18:33 nfs1-cluster openais[3046]: [TOTEM] Receive multicast socket 
recv buffer size (320000 bytes).
Feb 10 15:18:33 nfs1-cluster openais[3046]: [TOTEM] Transmit multicast socket 
send buffer size (262142 bytes).
Feb 10 15:18:33 nfs1-cluster openais[3046]: [TOTEM] entering GATHER state from 
2.
Feb 10 15:18:34 nfs1-cluster ntpd[3765]: synchronized to 132.236.56.250, 
stratum 2
Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] entering GATHER state from 
0.
Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] Creating commit token 
because I am the rep.
Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] Saving state aru 230 high 
seq received 230
Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] Storing new sequence id for 
ring 1f80
Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] entering COMMIT state.
Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] entering RECOVERY state.
Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] position [0] member 
140.90.91.240:
Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] previous ring seq 8060 rep 
140.90.91.240
Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] aru 230 high delivered 230 
received flag 1
Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] Did not need to originate 
any messages in recovery.
Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] Sending initial ORF token
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM  ] CLM CONFIGURATION CHANGE
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM  ] New Configuration:
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM  ]     r(0) ip(140.90.91.240)
Feb 10 15:18:35 nfs1-cluster kernel: dlm: closing connection to node 2
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM  ] Members Left:
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM  ]     r(0) ip(140.90.91.242)
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM  ] Members Joined:
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM  ] CLM CONFIGURATION CHANGE
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM  ] New Configuration:
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM  ]     r(0) ip(140.90.91.240)
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM  ] Members Left:
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM  ] Members Joined:
Feb 10 15:18:35 nfs1-cluster openais[3046]: [SYNC ] This node is within the 
primary component and will provide service.
Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] entering OPERATIONAL state.
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM  ] got nodejoin message 
140.90.91.240
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CPG  ] got joinlist message from 
node 1


I was seeing a number of these messages but they stopped after upgrading openais

nfs2-cluster openais[3012]: [TOTEM] Retransmit List: 1df3

Yes, these are in managed switches.  I will try to run the tcpdump asap.  
Unfortunately, that means I have to have it crash again to get what I need and 
my users are already annoyed by the downtime we've had.  I know this isn't the 
best solution for our needs, but given the lack of funding, this seemed like a 
good idea at the time.

Thanks for the help!

Randy





On 02/14/2011 09:03 AM, Digimer wrote:

On 02/14/2011 08:53 AM, Randy Brown wrote:

Hello,

I am running a 2 node cluster being used as a NAS head for a Lefthand
Networks iSCSI SAN to provide NFS mounts out to my network.  Things have
been OK for a while, but I recently lost one of the nodes as a result of
a patching problem.  In an effort to recreate the failed node, I imaged
the working node and installed that image on the failed node.  I set
it's hostname and IP settings correctly and the machine booted and
joined the cluster just fine.  Or at least it appeared so.  Things ran
OK for the last few weeks, but I recently started seeing a behavior
where the nodes start fencing each other.  I'm wondering if there is
something as a result of cloning the nodes that could be the problem.
Possibly something that should be different but isn't because of the
cloning?

I am running CentOS 5.5 with the following package versions:

Kernel - 2.6.18-194.11.3.el5 #1 SMP
cman-2.0.115-34.el5_5.4
lvm2-cluster-2.02.56-7.el5_5.4
gfs2-utils-0.1.62-20.el5
kmod-gfs-0.1.34-12.el5.centos
rgmanager-2.0.52-6.el5.centos.8

I have a Qlogic qla4062 HBA in the node running: QLogic iSCSI HBA Driver
(f8b83000) v5.01.03.04

I will gladly provide more information as needed.

Thank you,
Randy

Silly question, but are the NICs mapped to their MAC addresses? If so,
did you update the MAC addresses after cloning the server to reflect the
actual MAC addresses? Assuming so, do you have managed switches? If so,
can you test by swapping out a simple, unmanaged switch?

This sounds like a multicast issue at some level. Fencing happens once
the totem ring is declared failed. Do you see anything interesting in
the log files prior to the fence? Can you run tcpdump to see what is
happening on the interface(s) prior to the fence?

<<attachment: randy_brown.vcf>>

--
Linux-cluster mailing list
Linux-cluster@redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster

Re: [Linux-cluster] Problem with machines fencing one another in 2 Node NFS cluster

Reply via email to