gfs2 hangs if a node crashes

William Seligman Tue, 13 Mar 2012 13:37:12 -0700

On 3/13/12 2:49 PM, emmanuel segura wrote:
> Sorry Willian
> 
> But i think clvmd it must be used with
> 
> ocf:lvm2:clvmd
> 
> esample
> 
> crm confgiure
> primitive clvmd  ocf:lvm2:clvmd params daemon_timeout="30"
> 
> clone cln_clvmd clvmd
> 
> and rember clvmd depend on dlm, so for the dlm you sould same


I don't have an ocf:lvm2:clvmd resource on my system. When I do a web search, it
looks like a resource found on SUSE systems, but not on RHEL distros.

Based on "Clusters From Scratch", I think that if I'm using cman that dlm is
started automatically. I see dlm_controld is running without my explicitly
starting it:

# ps aux | grep dlm_controld
root  2495  0.0  0.0 234688  7564 ?  Ssl  12:32   0:00 dlm_controld

I should have also mentioned that I can duplicate this problem outside
pacemaker. That is, I can start cman, clvmd, and gfs2 manually on both nodes,
cut off power on one node, and clustering fails on the other node. So I suspect
it's not a pacemaker resource problem.

For a moment I thought I might not have used "-p lock_dlm" when I created my
GFS2 filesystems, but I think the output of "gfs2_edit -p sb ..." shows that I
did it correctly: <http://pastebin.com/ALQYpKAy>.

When I looked more carefully at my lvm.conf, I saw that I had a typo:

fallback_to_local_locking=4

I changed to it the correct value (according to
<https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial>):

fallback_to_local_locking=0

Unfortunately this doesn't solve the problem.

So... any ideas?

> Il giorno 13 marzo 2012 17:29, William Seligman <selig...@nevis.columbia.edu
>> ha scritto:
> 
>> I'm not sure if this is a "Linux-HA" question; please direct me to the 
>> appropriate list if it's not.
>> 
>> I'm setting up a two-node cman+pacemaker+gfs2 cluster as described in 
>> "Clusters From Scratch." Fencing is through forcibly rebooting a node by
>> cutting and restoring its power via UPS.
>> 
>> My fencing/failover tests have revealed a problem. If I gracefully turn off
>> one node ("crm node standby"; "service pacemaker stop"; "shutdown -r now")
>> all the resources transfer to the other node with no problems. If I cut
>> power to one node (as would happen if it were fenced), the lsb::clvmd
>> resource on the remaining node eventually fails. Since all the other
>> resources depend on clvmd, all the resources on the remaining node stop and
>> the cluster is left with nothing running.
>>
>> I've traced why the lsb::clvmd fails: The monitor/status command includes
>> "vgdisplay", which hangs indefinitely. Therefore the monitor will always
>> time-out.
>>
>> So this isn't a problem with pacemaker, but with clvmd/dlm: If a node is 
>> cut off, the cluster isn't handling it properly. Has anyone on this list
>> seen this before? Any ideas?
>>
>> Details:
>>
>> versions:
>> Redhat Linux 6.2 (kernel 2.6.32)
>> cman-3.0.12.1
>> corosync-1.4.1
>> pacemaker-1.1.6
>> lvm2-2.02.87
>> lvm2-cluster-2.02.87
>>
>> cluster.conf: <http://pastebin.com/w5XNYyAX>
>> output of "crm configure show": <http://pastebin.com/atVkXjkn>
>> output of "lvm dumpconfig": <http://pastebin.com/rtw8c3Pf>
>>
>> /var/log/cluster/dlm_controld.log and /var/log/cluster/gfs_controld.log 
>> show nothing. When I shut down power to one nodes (orestes-tb), the output
>> of grep -E "(dlm|gfs2|clvmd)" /var/log/messages is 
>> <http://pastebin.com/vjpvCFeN>.
-- 
Bill Seligman             | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137                |
Irvington NY 10533 USA    | http://www.nevis.columbia.edu/~seligman/

smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

Reply via email to