gfs2 hangs if a node crashes

emmanuel segura Sat, 24 Mar 2012 14:40:29 -0700

I think it's better you use clvmd with cman

I don't now why you use the lsb script of clvm


On Redhat clvmd need of cman and you try to running with pacemaker, i not
sure this is the problem but this type of configuration it's so strange

I made it a virtual cluster with kvm and i not foud a problems

Il giorno 24 marzo 2012 13:09, William Seligman <selig...@nevis.columbia.edu
> ha scritto:

> On 3/24/12 4:47 AM, emmanuel segura wrote:
> > How do you configure clvmd?
> >
> > with cman or with pacemaker?
>
> Pacemaker. Here's the output of 'crm configure show':
> <http://pastebin.com/426CdVwN>
>
> > Il giorno 23 marzo 2012 22:14, William Seligman <
> selig...@nevis.columbia.edu
> >> ha scritto:
> >
> >> On 3/23/12 5:03 PM, emmanuel segura wrote:
> >>
> >>> Sorry but i would to know if can show me your /etc/cluster/cluster.conf
> >>
> >> Here it is: <http://pastebin.com/GUr0CEgZ>
> >>
> >>> Il giorno 23 marzo 2012 21:50, William Seligman <
> >> selig...@nevis.columbia.edu
> >>>> ha scritto:
> >>>
> >>>> On 3/22/12 2:43 PM, William Seligman wrote:
> >>>>> On 3/20/12 4:55 PM, Lars Ellenberg wrote:
> >>>>>> On Fri, Mar 16, 2012 at 05:06:04PM -0400, William Seligman wrote:
> >>>>>>> On 3/16/12 12:12 PM, William Seligman wrote:
> >>>>>>>> On 3/16/12 7:02 AM, Andreas Kurz wrote:
> >>>>>>>>>
> >>>>>>>>> s----- ... DRBD suspended io, most likely because of it's
> >>>>>>>>> fencing-policy. For valid dual-primary setups you have to use
> >>>>>>>>> "resource-and-stonith" policy and a working "fence-peer" handler.
> >> In
> >>>>>>>>> this mode I/O is suspended until fencing of peer was succesful.
> >>>> Question
> >>>>>>>>> is, why the peer does _not_ also suspend its I/O because
> obviously
> >>>>>>>>> fencing was not successful .....
> >>>>>>>>>
> >>>>>>>>> So with a correct DRBD configuration one of your nodes should
> >> already
> >>>>>>>>> have been fenced because of connection loss between nodes (on
> drbd
> >>>>>>>>> replication link).
> >>>>>>>>>
> >>>>>>>>> You can use e.g. that nice fencing script:
> >>>>>>>>>
> >>>>>>>>> http://goo.gl/O4N8f
> >>>>>>>>
> >>>>>>>> This is the output of "drbdadm dump admin": <
> >>>> http://pastebin.com/kTxvHCtx>
> >>>>>>>>
> >>>>>>>> So I've got resource-and-stonith. I gather from an earlier thread
> >> that
> >>>>>>>> obliterate-peer.sh is more-or-less equivalent in functionality
> with
> >>>>>>>> stonith_admin_fence_peer.sh:
> >>>>>>>>
> >>>>>>>> <http://www.gossamer-threads.com/lists/linuxha/users/78504#78504>
> >>>>>>>>
> >>>>>>>> At the moment I'm pursuing the possibility that I'm returning the
> >>>> wrong return
> >>>>>>>> codes from my fencing agent:
> >>>>>>>>
> >>>>>>>> <http://www.gossamer-threads.com/lists/linuxha/users/78572>
> >>>>>>>
> >>>>>>> I cleaned up my fencing agent, making sure its return code matched
> >>>> those
> >>>>>>> returned by other agents in /usr/sbin/fence_, and allowing for some
> >>>> delay issues
> >>>>>>> in reading the UPS status. But...
> >>>>>>>
> >>>>>>>> After that, I'll look at another suggestion with lvm.conf:
> >>>>>>>>
> >>>>>>>> <http://www.gossamer-threads.com/lists/linuxha/users/78796#78796>
> >>>>>>>>
> >>>>>>>> Then I'll try DRBD 8.4.1. Hopefully one of these is the source of
> >> the
> >>>> issue.
> >>>>>>>
> >>>>>>> Failure on all three counts.
> >>>>>>
> >>>>>> May I suggest you double check the permissions on your fence peer
> >>>> script?
> >>>>>> I suspect you may simply have forgotten the "chmod +x" .
> >>>>>>
> >>>>>> Test with "drbdadm fence-peer minor-0" from the command line.
> >>>>>
> >>>>> I still haven't solved the problem, but this advice has gotten me
> >>>> further than
> >>>>> before.
> >>>>>
> >>>>> First, Lars was correct: I did not have execute permissions set on my
> >>>> fence peer
> >>>>> scripts. (D'oh!) I turned them on, but that did not change anything:
> >>>> cman+clvmd
> >>>>> still hung on the vgdisplay command if I crashed the peer node.
> >>>>>
> >>>>> I started up both nodes again (cman+pacemaker+drbd+clvmd) and tried
> >> Lars'
> >>>>> suggested command. I didn't save the response for this message (d'oh
> >>>> again!) but
> >>>>> it said that the fence-peer script had failed.
> >>>>>
> >>>>> Hmm. The peer was definitely shutting down, so my fencing script is
> >>>> working. I
> >>>>> went over it, comparing the return codes to those of the existing
> >>>> scripts, and
> >>>>> made some changes. Here's my current script: <
> >>>> http://pastebin.com/nUnYVcBK>.
> >>>>>
> >>>>> Up until now my fence-peer scripts had either been Lon Hohberger's
> >>>>> obliterate-peer.sh or Digimer's rhcs_fence. I decided to try
> >>>>> stonith_admin-fence-peer.sh that Andreas Kurz recommended; unlike the
> >>>> first two
> >>>>> scripts, which fence using fence_node, the latter script just calls
> >>>> stonith_admin.
> >>>>>
> >>>>> When I tried the stonith_admin-fence-peer.sh script, it worked:
> >>>>>
> >>>>> # drbdadm fence-peer minor-0
> >>>>> stonith_admin-fence-peer.sh[10886]: stonith_admin successfully fenced
> >>>> peer
> >>>>> orestes-corosync.nevis.columbia.edu.
> >>>>>
> >>>>> Power was cut on the peer, the remaining node stayed up. Then I
> brought
> >>>> up the
> >>>>> peer with:
> >>>>>
> >>>>> stonith_admin -U orestes-corosync.nevis.columbia.edu
> >>>>>
> >>>>> BUT: When the restored peer came up and started to run cman, the
> clvmd
> >>>> hung on
> >>>>> the main node again.
> >>>>>
> >>>>> After cycling through some more tests, I found that if I brought down
> >>>> the peer
> >>>>> with drbdadm, then brought up with the peer with no HA services, then
> >>>> started
> >>>>> drbd and then cman, the cluster remained intact.
> >>>>>
> >>>>> If I crashed the peer, the scheme in the previous paragraph didn't
> >> work.
> >>>> I bring
> >>>>> up drbd, check that the disks are both UpToDate, then bring up cman.
> At
> >>>> that
> >>>>> point the vgdisplay on the main node takes so long to run that clvmd
> >>>> will time out:
> >>>>>
> >>>>> vgdisplay  Error locking on node orestes-corosync.nevis.columbia.edu
> :
> >>>> Command
> >>>>> timed out
> >>>>>
> >>>>> I timed how long it took vgdisplay to run. I might be able to work
> >>>> around this
> >>>>> by setting the timeout on my clvmd resource to 300s, but that seems
> to
> >>>> be a
> >>>>> band-aid for an underlying problem. Any suggestions on what else I
> >> could
> >>>> check?
> >>>>
> >>>> I've done some more tests. Still no solution, just an observation: The
> >>>> "death
> >>>> mode" appears to be:
> >>>>
> >>>> - Two nodes running cman+pacemaker+drbd+clvmd
> >>>> - Take one node down = one remaining node w/cman+pacemaker+drbd+clvmd
> >>>> - Start up dead node. If it ever gets into a state in which it's
> running
> >>>> cman
> >>>> but not clvmd, clvmd on the uncrashed node hangs.
> >>>> - Conversely, if I bring up drbd, make it primary, start cman+clvmd,
> >>>> there's no
> >>>> problem on the uncrashed node.
> >>>>
> >>>> My guess is that clvmd is getting the number of nodes it expects from
> >>>> cman. When
> >>>> the formally-dead node starts running cman, the number of cluster
> nodes
> >>>> goes to
> >>>> 2 (I checked with 'cman_tool status') but the number of nodes running
> >>>> clvmd is
> >>>> still 1, hence the crash.
> >>>>
> >>>> Does this guess make sense?
>
> --
> Bill Seligman             | mailto://selig...@nevis.columbia.edu
> Nevis Labs, Columbia Univ | http://www.nevis.columbia.edu/~seligman/
> PO Box 137                |
> Irvington NY 10533  USA   | Phone: (914) 591-2823
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
>



-- 
esta es mi vida e me la vivo hasta que dios quiera
_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] clvm/dlm/gfs2 hangs if a node crashes

Reply via email to