gfs2 hangs if a node crashes - SOLVED

emmanuel segura Tue, 27 Mar 2012 01:53:12 -0700

William :-)

So now your cluster it's OK?


Il giorno 27 marzo 2012 00:33, William Seligman <selig...@nevis.columbia.edu
> ha scritto:

> On 3/26/12 5:31 PM, William Seligman wrote:
> > On 3/26/12 5:17 PM, William Seligman wrote:
> >> On 3/26/12 4:28 PM, emmanuel segura wrote:
>
> >>> and i suggest you to start clvmd at boot time
> >>>
> >>> chkconfig clvmd on
> >>
> >> I'm afraid this doesn't work. It's as I predicted; when gfs2 starts I
> get:
> >>
> >> Mounting GFS2 filesystem (/usr/nevis): invalid device path
> "/dev/mapper/ADMIN-usr"
> >>                                                            [FAILED]
> >>
> >> ... and so on, because the ADMIN volume group was never loaded by
> clvmd. Without
> >> a "vgscan" in there somewhere, the system can't see the volume groups
> on the
> >> drbd resource.
> >
> > Wait a second... there's an ocf:heartbeat:LVM resource! Testing...
>
> Emannuel, you did it!
>
> For the sake of future searches, and possibly future documentation, let me
> start
> with my original description of the problem:
>
> > I'm setting up a two-node cman+pacemaker+gfs2 cluster as described in
> "Clusters
> > From Scratch." Fencing is through forcibly rebooting a node by cutting
> and
> > restoring its power via UPS.
> >
> > My fencing/failover tests have revealed a problem. If I gracefully turn
> off one
> > node ("crm node standby"; "service pacemaker stop"; "shutdown -r now")
> all the
> > resources transfer to the other node with no problems. If I cut power to
> one
> > node (as would happen if it were fenced), the lsb::clvmd resource on the
> > remaining node eventually fails. Since all the other resources depend on
> clvmd,
> > all the resources on the remaining node stop and the cluster is left with
> > nothing running.
> >
> > I've traced why the lsb::clvmd fails: The monitor/status command includes
> > "vgdisplay", which hangs indefinitely. Therefore the monitor will always
> time-out.
> >
> > So this isn't a problem with pacemaker, but with clvmd/dlm: If a node is
> cut
> > off, the cluster isn't handling it properly. Has anyone on this list
> seen this
> > before? Any ideas?
> >
> > Details:
> >
> > versions:
> > Redhat Linux 6.2 (kernel 2.6.32)
> > cman-3.0.12.1
> > corosync-1.4.1
> > pacemaker-1.1.6
> > lvm2-2.02.87
> > lvm2-cluster-2.02.87
>
> The problem is that clvmd on the main node will hang if there's a
> substantive
> period of time during which the other node returns running cman but not
> clvmd. I
> never tracked down why this happens, but there's a practical solution:
> minimize
> any interval for which that would be true. To ensure this, take clvmd
> outside
> the resource manager's control:
>
> chkconfig cman on
> chkconfig clvmd on
> chkconfig pacemaker on
>
> On RHEL6.2, these services will be started in the above order; clvmd will
> start
> within a few seconds after cman.
>
> Here's my cluster.conf <http://pastebin.com/GUr0CEgZ> and the output of
> "crm
> configure show" <http://pastebin.com/f9D4Ui5Z>. The key lines from the
> latter are:
>
> primitive AdminDrbd ocf:linbit:drbd \
>        params drbd_resource="admin"
> primitive AdminLvm ocf:heartbeat:LVM \
>        params volgrpname="ADMIN" \
>        op monitor interval="30" timeout="100" depth="0"
> primitive Gfs2 lsb:gfs2
> group VolumeGroup AdminLvm Gfs2
> ms AdminClone AdminDrbd \
>        meta master-max="2" master-node-max="1" \
>        clone-max="2" clone-node-max="1" \
>        notify="true" interleave="true"
> clone VolumeClone VolumeGroup \
>        meta interleave="true"
> colocation Volume_With_Admin inf: VolumeClone AdminClone:Master
> order Admin_Before_Volume inf: AdminClone:promote VolumeClone:start
>
> What I learned: If one is going to extend the example in "Clusters From
> Scratch"
> to include logical volumes, one must start clvmd at boot time, and include
> any
> volume groups in ocf:heartbeat:LVM resources that start before gfs2.
>
> Note the long timeout on the ocf:heartbeat:LVM resource. This is a good
> idea
> because, during the boot of the crashed node, there'll still be an
> interval of a
> few seconds when cman will be running but clvmd won't be. During my tests,
> the
> LVM monitor would fail if it checked during that interval with a timeout
> that
> was shorter than it took clvmd to start on the crashed node. This was
> annoying;
> all resources dependent on AdminLvm would be stopped until AdminLvm
> recovered (a
> few more seconds). Increasing the timeout avoids this.
>
> It also means that during any recovery procedure on the crashed node for
> which I
> turn off all the services, I have to minimize the interval between the
> start of
> cman and clvmd if I've turned off services at boot; e.g.,
>
> service drbd start # ... and fix any split-brain problems or whatever
> service cman start; service clvmd start # put on one line
> service pacemaker start
>
> I thank everyone on this list who was patient with me as I pounded on this
> problem for two weeks!
> --
> Bill Seligman             | Phone: (914) 591-2823
> Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
> PO Box 137                |
> Irvington NY 10533 USA    | http://www.nevis.columbia.edu/~seligman/
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes - SOLVED

Reply via email to