On 3/15/12 4:57 PM, emmanuel segura wrote: > we can try to understand what happen when clvm hang > > edit the /etc/lvm/lvm.conf and change level = 7 in the log session and > uncomment this line > > file = "/var/log/lvm2.log"
Here's the tail end of the file (the original is 1.6M). Because there no times in the log, it's hard for me to point you to the point where I crashed the other system. I think (though I'm not sure) that the crash happened after the last occurrence of cache/lvmcache.c:1484 Wiping internal VG cache Honestly, it looks like a wall of text to me. Does it suggest anything to you? > Il giorno 15 marzo 2012 20:50, William Seligman <selig...@nevis.columbia.edu >> ha scritto: > >> On 3/15/12 12:55 PM, emmanuel segura wrote: >> >>> I don't see any error and the answer for your question it's yes >>> >>> can you show me your /etc/cluster/cluster.conf and your crm configure >> show >>> >>> like that more later i can try to look if i found some fix >> >> Thanks for taking a look. >> >> My cluster.conf: <http://pastebin.com/w5XNYyAX> >> crm configure show: <http://pastebin.com/atVkXjkn> >> >> Before you spend a lot of time on the second file, remember that clvmd >> will hang >> whether or not I'm running pacemaker. >> >>> Il giorno 15 marzo 2012 17:42, William Seligman < >> selig...@nevis.columbia.edu >>>> ha scritto: >>> >>>> On 3/15/12 12:15 PM, emmanuel segura wrote: >>>> >>>>> Ho did you created your volume group >>>> >>>> pvcreate /dev/drbd0 >>>> vgcreate -c y ADMIN /dev/drbd0 >>>> lvcreate -L 200G -n usr ADMIN # ... and so on >>>> # "Nevis-HA" is the cluster name I used in cluster.conf >>>> mkfs.gfs2 -p lock_dlm -j 2 -t Nevis_HA:usr /dev/ADMIN/usr # ... and so >> on >>>> >>>>> give me the output of vgs command when the cluster it's up >>>> >>>> Here it is: >>>> >>>> Logging initialised at Thu Mar 15 12:40:39 2012 >>>> Set umask from 0022 to 0077 >>>> Finding all volume groups >>>> Finding volume group "ROOT" >>>> Finding volume group "ADMIN" >>>> VG #PV #LV #SN Attr VSize VFree >>>> ADMIN 1 5 0 wz--nc 2.61t 765.79g >>>> ROOT 1 2 0 wz--n- 117.16g 0 >>>> Wiping internal VG cache >>>> >>>> I assume the "c" in the ADMIN attributes means that clustering is turned >>>> on? >>>> >>>>> Il giorno 15 marzo 2012 17:06, William Seligman < >>>> selig...@nevis.columbia.edu >>>>>> ha scritto: >>>>> >>>>>> On 3/15/12 11:50 AM, emmanuel segura wrote: >>>>>>> yes william >>>>>>> >>>>>>> Now try clvmd -d and see what happen >>>>>>> >>>>>>> locking_type = 3 it's lvm cluster lock type >>>>>> >>>>>> Since you asked for confirmation, here it is: the output of 'clvmd -d' >>>>>> just now. <http://pastebin.com/bne8piEw>. I crashed the other node at >>>>>> Mar 15 12:02:35, when you see the only additional line of output. >>>>>> >>>>>> I don't see any particular difference between this and the previous >>>>>> result <http://pastebin.com/sWjaxAEF>, which suggests that I had >>>>>> cluster locking enabled before, and still do now. >>>>>> >>>>>>> Il giorno 15 marzo 2012 16:15, William Seligman < >>>>>> selig...@nevis.columbia.edu >>>>>>>> ha scritto: >>>>>>> >>>>>>>> On 3/15/12 5:18 AM, emmanuel segura wrote: >>>>>>>> >>>>>>>>> The first thing i seen in your clvmd log it's this >>>>>>>>> >>>>>>>>> ============================================= >>>>>>>>> WARNING: Locking disabled. Be careful! This could corrupt your >> metadata. >>>>>>>>> ============================================= >>>>>>>> >>>>>>>> I saw that too, and thought the same as you did. I did some checks >>>>>>>> (see below), but some web searches suggest that this message is a >>>>>>>> normal consequence of clvmd initialization; e.g., >>>>>>>> >>>>>>>> <http://markmail.org/message/vmy53pcv52wu7ghx> >>>>>>>> >>>>>>>>> use this command >>>>>>>>> >>>>>>>>> lvmconf --enable-cluster >>>>>>>>> >>>>>>>>> and remember for cman+pacemaker you don't need qdisk >>>>>>>> >>>>>>>> Before I tried your lvmconf suggestion, here was my >> /etc/lvm/lvm.conf: >>>>>>>> <http://pastebin.com/841VZRzW> and the output of "lvm dumpconfig": >>>>>>>> <http://pastebin.com/rtw8c3Pf>. >>>>>>>> >>>>>>>> Then I did as you suggested, but with a check to see if anything >>>>>>>> changed: >>>>>>>> >>>>>>>> # cd /etc/lvm/ >>>>>>>> # cp lvm.conf lvm.conf.cluster >>>>>>>> # lvmconf --enable-cluster >>>>>>>> # diff lvm.conf lvm.conf.cluster >>>>>>>> # >>>>>>>> >>>>>>>> So the key lines have been there all along: >>>>>>>> locking_type = 3 >>>>>>>> fallback_to_local_locking = 0 >>>>>>>> >>>>>>>> >>>>>>>>> Il giorno 14 marzo 2012 23:17, William Seligman < >>>>>>>> selig...@nevis.columbia.edu >>>>>>>>>> ha scritto: >>>>>>>>> >>>>>>>>>> On 3/14/12 9:20 AM, emmanuel segura wrote: >>>>>>>>>>> Hello William >>>>>>>>>>> >>>>>>>>>>> i did new you are using drbd and i dont't know what type of >>>>>>>>>>> configuration you using >>>>>>>>>>> >>>>>>>>>>> But it's better you try to start clvm with clvmd -d >>>>>>>>>>> >>>>>>>>>>> like thak we can see what it's the problem >>>>>>>>>> >>>>>>>>>> For what it's worth, here's the output of running clvmd -d on >>>>>>>>>> the node that stays up: <http://pastebin.com/sWjaxAEF> >>>>>>>>>> >>>>>>>>>> What's probably important in that big mass of output are the >>>>>>>>>> last two lines. Up to that point, I have both nodes up and >>>>>>>>>> running cman + clvmd; cluster.conf is here: >>>>>>>>>> <http://pastebin.com/w5XNYyAX> >>>>>>>>>> >>>>>>>>>> At the time of the next-to-the-last line, I cut power to the >>>>>>>>>> other node. >>>>>>>>>> >>>>>>>>>> At the time of the last line, I run "vgdisplay" on the >>>>>>>>>> remaining node, which hangs forever. >>>>>>>>>> >>>>>>>>>> After a lot of web searching, I found that I'm not the only one >>>>>>>>>> with this problem. Here's one case that doesn't seem relevant >>>>>>>>>> to me, since I don't use qdisk: >>>>>>>>>> < >> http://www.redhat.com/archives/linux-cluster/2007-October/msg00212.html>. >>>>>>>>>> Here's one with the same problem with the same OS: >>>>>>>>>> <http://bugs.centos.org/view.php?id=5229>, but with no >> resolution. >>>>>>>>>> >>>>>>>>>> Out of curiosity, has anyone on this list made a two-node >>>>>>>>>> cman+clvmd cluster work for them? >>>>>>>>>> >>>>>>>>>>> Il giorno 14 marzo 2012 14:02, William Seligman < >>>>>>>>>> selig...@nevis.columbia.edu >>>>>>>>>>>> ha scritto: >>>>>>>>>>> >>>>>>>>>>>> On 3/14/12 6:02 AM, emmanuel segura wrote: >>>>>>>>>>>> >>>>>>>>>>>> I think it's better you make clvmd start at boot >>>>>>>>>>>>> >>>>>>>>>>>>> chkconfig cman on ; chkconfig clvmd on >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I've already tried it. It doesn't work. The problem is that >>>>>>>>>>>> my LVM information is on the drbd. If I start up clvmd >>>>>>>>>>>> before drbd, it won't find the logical volumes. >>>>>>>>>>>> >>>>>>>>>>>> I also don't see why that would make a difference (although >>>>>>>>>>>> this could be part of the confusion): a service is a >>>>>>>>>>>> service. I've tried starting up clvmd inside and outside >>>>>>>>>>>> pacemaker control, with the same problem. Why would >>>>>>>>>>>> starting clvmd at boot make a difference? >>>>>>>>>>>> >>>>>>>>>>>> Il giorno 13 marzo 2012 23:29, William Seligman< >> selig...@nevis.columbia.edu> >>>>>>>>>>>>> >>>>>>>>>>>>>> ha scritto: >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On 3/13/12 5:50 PM, emmanuel segura wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> So if you using cman why you use lsb::clvmd >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I think you are very confused >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> I don't dispute that I may be very confused! >>>>>>>>>>>>>> >>>>>>>>>>>>>> However, from what I can tell, I still need to run >>>>>>>>>>>>>> clvmd even if I'm running cman (I'm not using >>>>>>>>>>>>>> rgmanager). If I just run cman, gfs2 and any other form >>>>>>>>>>>>>> of mount fails. If I run cman, then clvmd, then gfs2, >>>>>>>>>>>>>> everything behaves normally. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Going by these instructions: >>>>>>>>>>>>>> >>>>>>>>>>>>>> <https://alteeve.com/w/2-Node_**Red_Hat_KVM_Cluster_Tutorial> >>>>>>>>>>>>>> >>>>>>>>>>>>>> the resources he puts under "cluster control" >>>>>>>>>>>>>> (rgmanager) I have to put under pacemaker control. >>>>>>>>>>>>>> Those include drbd, clvmd, and gfs2. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The difference between what I've got, and what's in >>>>>>>>>>>>>> "Clusters From Scratch", is in CFS they assign one DRBD >>>>>>>>>>>>>> volume to a single filesystem. I create an LVM physical >>>>>>>>>>>>>> volume on my DRBD resource, as in the above tutorial, >>>>>>>>>>>>>> and so I have to start clvmd or the logical volumes in >>>>>>>>>>>>>> the DRBD partition won't be recognized.>> Is there some >>>>>>>>>>>>>> way to get logical volumes recognized automatically by >>>>>>>>>>>>>> cman without rgmanager that I've missed? >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Il giorno 13 marzo 2012 22:42, William Seligman< >>>>>>>>>>>>>>> >>>>>>>>>>>>>> selig...@nevis.columbia.edu >>>>>>>>>>>>>> >>>>>>>>>>>>>>> ha scritto: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 3/13/12 12:29 PM, William Seligman wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I'm not sure if this is a "Linux-HA" question; >>>>>>>>>>>>>>>>> please direct me to the appropriate list if it's >>>>>>>>>>>>>>>>> not. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I'm setting up a two-node cman+pacemaker+gfs2 >>>>>>>>>>>>>>>>> cluster as described in "Clusters From Scratch." >>>>>>>>>>>>>>>>> Fencing is through forcibly rebooting a node by >>>>>>>>>>>>>>>>> cutting and restoring its power via UPS. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> My fencing/failover tests have revealed a >>>>>>>>>>>>>>>>> problem. If I gracefully turn off one node ("crm >>>>>>>>>>>>>>>>> node standby"; "service pacemaker stop"; >>>>>>>>>>>>>>>>> "shutdown -r now") all the resources transfer to >>>>>>>>>>>>>>>>> the other node with no problems. If I cut power >>>>>>>>>>>>>>>>> to one node (as would happen if it were fenced), >>>>>>>>>>>>>>>>> the lsb::clvmd resource on the remaining node >>>>>>>>>>>>>>>>> eventually fails. Since all the other resources >>>>>>>>>>>>>>>>> depend on clvmd, all the resources on the >>>>>>>>>>>>>>>>> remaining node stop and the cluster is left with >>>>>>>>>>>>>>>>> nothing running. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I've traced why the lsb::clvmd fails: The >>>>>>>>>>>>>>>>> monitor/status command includes "vgdisplay", >>>>>>>>>>>>>>>>> which hangs indefinitely. Therefore the monitor >>>>>>>>>>>>>>>>> will always time-out. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> So this isn't a problem with pacemaker, but with >>>>>>>>>>>>>>>>> clvmd/dlm: If a node is cut off, the cluster >>>>>>>>>>>>>>>>> isn't handling it properly. Has anyone on this >>>>>>>>>>>>>>>>> list seen this before? Any ideas? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Details: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> versions: >>>>>>>>>>>>>>>>> Redhat Linux 6.2 (kernel 2.6.32) >>>>>>>>>>>>>>>>> cman-3.0.12.1 >>>>>>>>>>>>>>>>> corosync-1.4.1 >>>>>>>>>>>>>>>>> pacemaker-1.1.6 >>>>>>>>>>>>>>>>> lvm2-2.02.87 >>>>>>>>>>>>>>>>> lvm2-cluster-2.02.87 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> This may be a Linux-HA question after all! >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I ran a few more tests. Here's the output from a >>>>>>>>>>>>>>>> typical test of >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> grep -E "(dlm|gfs2}clvmd|fenc|syslogd)**" >>>>>>>>>>>>>>>> /var/log/messages >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> <http://pastebin.com/uqC6bc1b> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> It looks like what's happening is that the fence >>>>>>>>>>>>>>>> agent (one I wrote) is not returning the proper >>>>>>>>>>>>>>>> error code when a node crashes. According to this >>>>>>>>>>>>>>>> page, if a fencing agent fails GFS2 will freeze to >>>>>>>>>>>>>>>> protect the data: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> < >> http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Global_File_System_2/s1-gfs2hand-allnodes.html >>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> As a test, I tried to fence my test node via >>>>>>>>>>>>>>>> standard means: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> stonith_admin -F \ >>>>>>>>>>>>>>>> orestes-corosync.nevis.columbia.edu >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> These were the log messages, which show that >>>>>>>>>>>>>>>> stonith_admin did its job and CMAN was notified of >>>>>>>>>>>>>>>> the fencing:<http://pastebin.com/jaH820Bv>. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Unfortunately, I still got the gfs2 freeze, so this >>>>>>>>>>>>>>>> is not the complete story. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> First things first. I vaguely recall a web page >>>>>>>>>>>>>>>> that went over the STONITH return codes, but I >>>>>>>>>>>>>>>> can't locate it again. Is there any reference to >>>>>>>>>>>>>>>> the return codes expected from a fencing agent, >>>>>>>>>>>>>>>> perhaps as function of the state of the fencing >>>>>>>>>>>>>>>> device? -- Bill Seligman | Phone: (914) 591-2823 Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu PO Box 137 | Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman/
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems