William :-) So now your cluster it's OK?
Il giorno 27 marzo 2012 00:33, William Seligman <selig...@nevis.columbia.edu > ha scritto: > On 3/26/12 5:31 PM, William Seligman wrote: > > On 3/26/12 5:17 PM, William Seligman wrote: > >> On 3/26/12 4:28 PM, emmanuel segura wrote: > > >>> and i suggest you to start clvmd at boot time > >>> > >>> chkconfig clvmd on > >> > >> I'm afraid this doesn't work. It's as I predicted; when gfs2 starts I > get: > >> > >> Mounting GFS2 filesystem (/usr/nevis): invalid device path > "/dev/mapper/ADMIN-usr" > >> [FAILED] > >> > >> ... and so on, because the ADMIN volume group was never loaded by > clvmd. Without > >> a "vgscan" in there somewhere, the system can't see the volume groups > on the > >> drbd resource. > > > > Wait a second... there's an ocf:heartbeat:LVM resource! Testing... > > Emannuel, you did it! > > For the sake of future searches, and possibly future documentation, let me > start > with my original description of the problem: > > > I'm setting up a two-node cman+pacemaker+gfs2 cluster as described in > "Clusters > > From Scratch." Fencing is through forcibly rebooting a node by cutting > and > > restoring its power via UPS. > > > > My fencing/failover tests have revealed a problem. If I gracefully turn > off one > > node ("crm node standby"; "service pacemaker stop"; "shutdown -r now") > all the > > resources transfer to the other node with no problems. If I cut power to > one > > node (as would happen if it were fenced), the lsb::clvmd resource on the > > remaining node eventually fails. Since all the other resources depend on > clvmd, > > all the resources on the remaining node stop and the cluster is left with > > nothing running. > > > > I've traced why the lsb::clvmd fails: The monitor/status command includes > > "vgdisplay", which hangs indefinitely. Therefore the monitor will always > time-out. > > > > So this isn't a problem with pacemaker, but with clvmd/dlm: If a node is > cut > > off, the cluster isn't handling it properly. Has anyone on this list > seen this > > before? Any ideas? > > > > Details: > > > > versions: > > Redhat Linux 6.2 (kernel 2.6.32) > > cman-3.0.12.1 > > corosync-1.4.1 > > pacemaker-1.1.6 > > lvm2-2.02.87 > > lvm2-cluster-2.02.87 > > The problem is that clvmd on the main node will hang if there's a > substantive > period of time during which the other node returns running cman but not > clvmd. I > never tracked down why this happens, but there's a practical solution: > minimize > any interval for which that would be true. To ensure this, take clvmd > outside > the resource manager's control: > > chkconfig cman on > chkconfig clvmd on > chkconfig pacemaker on > > On RHEL6.2, these services will be started in the above order; clvmd will > start > within a few seconds after cman. > > Here's my cluster.conf <http://pastebin.com/GUr0CEgZ> and the output of > "crm > configure show" <http://pastebin.com/f9D4Ui5Z>. The key lines from the > latter are: > > primitive AdminDrbd ocf:linbit:drbd \ > params drbd_resource="admin" > primitive AdminLvm ocf:heartbeat:LVM \ > params volgrpname="ADMIN" \ > op monitor interval="30" timeout="100" depth="0" > primitive Gfs2 lsb:gfs2 > group VolumeGroup AdminLvm Gfs2 > ms AdminClone AdminDrbd \ > meta master-max="2" master-node-max="1" \ > clone-max="2" clone-node-max="1" \ > notify="true" interleave="true" > clone VolumeClone VolumeGroup \ > meta interleave="true" > colocation Volume_With_Admin inf: VolumeClone AdminClone:Master > order Admin_Before_Volume inf: AdminClone:promote VolumeClone:start > > What I learned: If one is going to extend the example in "Clusters From > Scratch" > to include logical volumes, one must start clvmd at boot time, and include > any > volume groups in ocf:heartbeat:LVM resources that start before gfs2. > > Note the long timeout on the ocf:heartbeat:LVM resource. This is a good > idea > because, during the boot of the crashed node, there'll still be an > interval of a > few seconds when cman will be running but clvmd won't be. During my tests, > the > LVM monitor would fail if it checked during that interval with a timeout > that > was shorter than it took clvmd to start on the crashed node. This was > annoying; > all resources dependent on AdminLvm would be stopped until AdminLvm > recovered (a > few more seconds). Increasing the timeout avoids this. > > It also means that during any recovery procedure on the crashed node for > which I > turn off all the services, I have to minimize the interval between the > start of > cman and clvmd if I've turned off services at boot; e.g., > > service drbd start # ... and fix any split-brain problems or whatever > service cman start; service clvmd start # put on one line > service pacemaker start > > I thank everyone on this list who was patient with me as I pounded on this > problem for two weeks! > -- > Bill Seligman | Phone: (914) 591-2823 > Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu > PO Box 137 | > Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman/ > > > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems > -- esta es mi vida e me la vivo hasta que dios quiera _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems