Re: [Linux-HA] Apparent problem in pacemaker ordering

Dejan Muhamedagic Thu, 08 Mar 2012 04:23:15 -0800

Hi,

On Wed, Mar 07, 2012 at 07:52:16PM -0500, William Seligman wrote:
> On 3/5/12 11:55 AM, William Seligman wrote:
> > On 3/3/12 3:30 PM, William Seligman wrote:
> >> On 3/3/12 2:14 PM, Florian Haas wrote:
> >>> On Sat, Mar 3, 2012 at 6:55 PM, William Seligman
> >>> <selig...@nevis.columbia.edu>  wrote:
> >>>> On 3/3/12 12:03 PM, emmanuel segura wrote:
> >>>>>
> >>>>> are you sure the exportfs agent can be use it with clone active/active?
> >>>>
> >>>> a) I've been through the script. If there's some problem associated with 
> >>>> it
> >>>> being cloned, I haven't seen it. (It can't handle globally-unique="true",
> >>>> but I didn't turn that on.)
> >>>
> >>> It shouldn't have a problem with being cloned. Obviously, cloning that
> >>> RA _really_ makes sense only with the export that manages an NFSv4
> >>> virtual root (fsid=0). Otherwise, the export clone has to be hosted on
> >>> a clustered filesystem, and you'd have to have a pNFS implementation
> >>> that doesn't suck (tough to come by on Linux), and if you want that
> >>> sort of replicate, parallel-access NFS you might as well use Gluster.
> >>> The downside of the latter, though, is it's currently NFSv3-only,
> >>> without sideband locking.
> >>
> >> I'll look this over when I have a chance. I think I can get away without a 
> >> NFSv4
> >> virtual root because I'm exporting everything to my cluster either 
> >> read-only, or
> >> only one system at a time will do any writing. Now that you've warned me, 
> >> I'll
> >> do some more checking.
> >>
> >>>> b) I had similar problems using the exportfs resource in a 
> >>>> primary-secondary
> >>>> setup without clones.
> >>>>
> >>>> Why would a resource being cloned create an ordering problem? I haven't 
> >>>> set
> >>>> the interleave parameter (even with the documentation I'm not sure what 
> >>>> it
> >>>> does) but A before B before C seems pretty clear, even for cloned 
> >>>> resources.
> >>>
> >>> As far as what interleave does. Suppose you have two clones, A and B.
> >>> And they're linked with an order constraint, like this:
> >>>
> >>> order A_before_B inf: A B
> >>>
> >>> ... then if interleave is false, _all_ instances of A must be started
> >>> before _any_ instance of B gets to start anywhere in the cluster.
> >>> However if interleave is true, then for any node only the _local_
> >>> instance of A needs to be started before it can start the
> >>> corresponding _local_ instance of B.
> >>>
> >>> In other words, interleave=true is actually the reasonable thing to
> >>> set on all clone instances by default, and I believe the pengine
> >>> actually does use a default of interleave=true on defined clone sets
> >>> since some 1.1.x release (I don't recall which).
> >>
> >> Thanks, Florian. That's a great explanation. I'll probably stick
> >> "interleave=true" on most of my clones just to make sure.
> >>
> >> It explains an error message I've seen in the logs:
> >>
> >> Mar  2 18:15:19 hypatia-tb pengine: [4414]: ERROR: clone_rsc_colocation_rh:
> >> Cannot interleave clone ClusterIPClone and Gfs2Clone because they do not 
> >> support
> >> the same number of resources per node
> >>
> >> Because ClusterIPClone has globally-unique=true and clone-max=2, it's 
> >> possible
> >> for both instances to be running on a single node; I've seen this a few 
> >> times in
> >> my testing when cycling power on one of the nodes. Interleaving doesn't 
> >> make
> >> sense in such a case.
> >>
> >>> Bill, seeing as you've already pastebinned your config and crm_mon
> >>> output, could you also pastebin your whole CIB as per "cibadmin -Q"
> >>> output? Thanks.
> >>
> >> Sure: <http://pastebin.com/pjSJ79H6>. It doesn't have the exportfs 
> >> resources in
> >> it; I took them out before leaving for the weekend. If it helps, I'll put 
> >> them
> >> back in and try to get the "cibadmin -Q" output before any nodes crash.
> >>
> > 
> > For a test, I stuck in a exportfs resource with all the ordering 
> > constraints.
> > Here's the "cibadmin -Q" output from that:
> > 
> > <http://pastebin.com/nugdufJc>
> > 
> > The output of crm_mon just after doing that, showing resource failure:
> > 
> > <http://pastebin.com/cyCFGUSD>
> > 
> > Then all the resources are stopped:
> > 
> > <http://pastebin.com/D62sGSrj>
> > 
> > A few seconds later one of the nodes is fenced, but this does not bring up
> > anything:
> > 
> > <http://pastebin.com/wzbmfVas>
> 
> I believe I have the solution to my stability problem. It doesn't solve the
> issue of ordering, but I think I have a configuration that will survive 
> failover.
> 
> Here's the problem. I had exportfs resources such as:
> 
> primitive ExportUsrNevis ocf:heartbeat:exportfs \
>         op start interval="0" timeout="40" \
>         op stop interval="0" timeout="45" \
>         params clientspec="*.nevis.columbia.edu" directory="/usr/nevis" \
>         fsid="20" options="ro,no_root_squash,async"
> 
> I did detailed traces of the execution of exportfs (putting in logger 
> commands)
> and found that the problem was in the backup_rmtab function in exportfs:
> 
> backup_rmtab() {
>     local rmtab_backup
>     if [ ${OCF_RESKEY_rmtab_backup} != "none" ]; then
>       rmtab_backup="${OCF_RESKEY_directory}/${OCF_RESKEY_rmtab_backup}"
>       grep ":${OCF_RESKEY_directory}:" /var/lib/nfs/rmtab > ${rmtab_backup}
>     fi
> }
> 
> The problem was that the grep command was taking a long time, longer than any
> timeout I'd assigned to the resource. I looked at /var/lib/nfs/rmtab, and saw 
> it
> was 60GB on one of my nodes and 16GB on the other. Since backup_rmtab() is


Oops.

> called during the "stop" action, the resource could never successfully stop; 
> it
> would always timeout. This led to the state shown in the pastebins above: no
> amount of pacemaker resource restarting or fencing could fix the problem.
> 
> My fixes were:
> 
> - "rm /var/lib/nfs/rmtab; touch /var/lib/nfs/rmtab" on both the nodes
> 
> - going to all the directories I'd exported and deleting any .rmtab files I 
> found
> 
> - adding the parameter 'rmtab_backup="none"' to all my exportfs resources. I
> believe I can get away with this since I think all my clients are mounting via
> NFSv4 and using automount to do it. I'll run some tests with cluster failover
> while clients are mounting to be sure.

I can recall there was some discussion about the merit of doing
rmtab backup.

> The next question is: how did /var/lib/nfs/rmtab get so big? When I looked at
> the file, I saw the same two clients listed over and over again; I used those
> clients to mount an exported partition as a test. Somehow, with cluster 
> failures
> and restarts, backup_rmtab() and restore_rmtab() in exportfs got into a loop 
> in
> which those client entries were accumulated and never deleted.
> 
> Perhaps this could prevented in restore_rmtab() by replacing the line
> 
>    cat ${rmtab_backup} >> /var/lib/nfs/rmtab

That's really disastrous. I wonder how did that go in.

> with something like
> 
>    cat ${rmtab_backup} /var/lib/nfs/rmtab | sort -u | cat - 
> >/var/lib/nfs/rmtab

Looks better.

> I'll leave it to the experts to determine whether that would work, or if it's
> really necessary.

Thanks for the analysis.

Cheers,

Dejan

> -- 
> Bill Seligman             | Phone: (914) 591-2823
> Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
> PO Box 137                |
> Irvington NY 10533 USA    | http://www.nevis.columbia.edu/~seligman/
> 



> _______________________________________________
> Linux-HA mailing list
> Linux-HA@lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Apparent problem in pacemaker ordering

Reply via email to