Hi, On Wed, Mar 07, 2012 at 07:52:16PM -0500, William Seligman wrote: > On 3/5/12 11:55 AM, William Seligman wrote: > > On 3/3/12 3:30 PM, William Seligman wrote: > >> On 3/3/12 2:14 PM, Florian Haas wrote: > >>> On Sat, Mar 3, 2012 at 6:55 PM, William Seligman > >>> <selig...@nevis.columbia.edu> wrote: > >>>> On 3/3/12 12:03 PM, emmanuel segura wrote: > >>>>> > >>>>> are you sure the exportfs agent can be use it with clone active/active? > >>>> > >>>> a) I've been through the script. If there's some problem associated with > >>>> it > >>>> being cloned, I haven't seen it. (It can't handle globally-unique="true", > >>>> but I didn't turn that on.) > >>> > >>> It shouldn't have a problem with being cloned. Obviously, cloning that > >>> RA _really_ makes sense only with the export that manages an NFSv4 > >>> virtual root (fsid=0). Otherwise, the export clone has to be hosted on > >>> a clustered filesystem, and you'd have to have a pNFS implementation > >>> that doesn't suck (tough to come by on Linux), and if you want that > >>> sort of replicate, parallel-access NFS you might as well use Gluster. > >>> The downside of the latter, though, is it's currently NFSv3-only, > >>> without sideband locking. > >> > >> I'll look this over when I have a chance. I think I can get away without a > >> NFSv4 > >> virtual root because I'm exporting everything to my cluster either > >> read-only, or > >> only one system at a time will do any writing. Now that you've warned me, > >> I'll > >> do some more checking. > >> > >>>> b) I had similar problems using the exportfs resource in a > >>>> primary-secondary > >>>> setup without clones. > >>>> > >>>> Why would a resource being cloned create an ordering problem? I haven't > >>>> set > >>>> the interleave parameter (even with the documentation I'm not sure what > >>>> it > >>>> does) but A before B before C seems pretty clear, even for cloned > >>>> resources. > >>> > >>> As far as what interleave does. Suppose you have two clones, A and B. > >>> And they're linked with an order constraint, like this: > >>> > >>> order A_before_B inf: A B > >>> > >>> ... then if interleave is false, _all_ instances of A must be started > >>> before _any_ instance of B gets to start anywhere in the cluster. > >>> However if interleave is true, then for any node only the _local_ > >>> instance of A needs to be started before it can start the > >>> corresponding _local_ instance of B. > >>> > >>> In other words, interleave=true is actually the reasonable thing to > >>> set on all clone instances by default, and I believe the pengine > >>> actually does use a default of interleave=true on defined clone sets > >>> since some 1.1.x release (I don't recall which). > >> > >> Thanks, Florian. That's a great explanation. I'll probably stick > >> "interleave=true" on most of my clones just to make sure. > >> > >> It explains an error message I've seen in the logs: > >> > >> Mar 2 18:15:19 hypatia-tb pengine: [4414]: ERROR: clone_rsc_colocation_rh: > >> Cannot interleave clone ClusterIPClone and Gfs2Clone because they do not > >> support > >> the same number of resources per node > >> > >> Because ClusterIPClone has globally-unique=true and clone-max=2, it's > >> possible > >> for both instances to be running on a single node; I've seen this a few > >> times in > >> my testing when cycling power on one of the nodes. Interleaving doesn't > >> make > >> sense in such a case. > >> > >>> Bill, seeing as you've already pastebinned your config and crm_mon > >>> output, could you also pastebin your whole CIB as per "cibadmin -Q" > >>> output? Thanks. > >> > >> Sure: <http://pastebin.com/pjSJ79H6>. It doesn't have the exportfs > >> resources in > >> it; I took them out before leaving for the weekend. If it helps, I'll put > >> them > >> back in and try to get the "cibadmin -Q" output before any nodes crash. > >> > > > > For a test, I stuck in a exportfs resource with all the ordering > > constraints. > > Here's the "cibadmin -Q" output from that: > > > > <http://pastebin.com/nugdufJc> > > > > The output of crm_mon just after doing that, showing resource failure: > > > > <http://pastebin.com/cyCFGUSD> > > > > Then all the resources are stopped: > > > > <http://pastebin.com/D62sGSrj> > > > > A few seconds later one of the nodes is fenced, but this does not bring up > > anything: > > > > <http://pastebin.com/wzbmfVas> > > I believe I have the solution to my stability problem. It doesn't solve the > issue of ordering, but I think I have a configuration that will survive > failover. > > Here's the problem. I had exportfs resources such as: > > primitive ExportUsrNevis ocf:heartbeat:exportfs \ > op start interval="0" timeout="40" \ > op stop interval="0" timeout="45" \ > params clientspec="*.nevis.columbia.edu" directory="/usr/nevis" \ > fsid="20" options="ro,no_root_squash,async" > > I did detailed traces of the execution of exportfs (putting in logger > commands) > and found that the problem was in the backup_rmtab function in exportfs: > > backup_rmtab() { > local rmtab_backup > if [ ${OCF_RESKEY_rmtab_backup} != "none" ]; then > rmtab_backup="${OCF_RESKEY_directory}/${OCF_RESKEY_rmtab_backup}" > grep ":${OCF_RESKEY_directory}:" /var/lib/nfs/rmtab > ${rmtab_backup} > fi > } > > The problem was that the grep command was taking a long time, longer than any > timeout I'd assigned to the resource. I looked at /var/lib/nfs/rmtab, and saw > it > was 60GB on one of my nodes and 16GB on the other. Since backup_rmtab() is
Oops. > called during the "stop" action, the resource could never successfully stop; > it > would always timeout. This led to the state shown in the pastebins above: no > amount of pacemaker resource restarting or fencing could fix the problem. > > My fixes were: > > - "rm /var/lib/nfs/rmtab; touch /var/lib/nfs/rmtab" on both the nodes > > - going to all the directories I'd exported and deleting any .rmtab files I > found > > - adding the parameter 'rmtab_backup="none"' to all my exportfs resources. I > believe I can get away with this since I think all my clients are mounting via > NFSv4 and using automount to do it. I'll run some tests with cluster failover > while clients are mounting to be sure. I can recall there was some discussion about the merit of doing rmtab backup. > The next question is: how did /var/lib/nfs/rmtab get so big? When I looked at > the file, I saw the same two clients listed over and over again; I used those > clients to mount an exported partition as a test. Somehow, with cluster > failures > and restarts, backup_rmtab() and restore_rmtab() in exportfs got into a loop > in > which those client entries were accumulated and never deleted. > > Perhaps this could prevented in restore_rmtab() by replacing the line > > cat ${rmtab_backup} >> /var/lib/nfs/rmtab That's really disastrous. I wonder how did that go in. > with something like > > cat ${rmtab_backup} /var/lib/nfs/rmtab | sort -u | cat - > >/var/lib/nfs/rmtab Looks better. > I'll leave it to the experts to determine whether that would work, or if it's > really necessary. Thanks for the analysis. Cheers, Dejan > -- > Bill Seligman | Phone: (914) 591-2823 > Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu > PO Box 137 | > Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman/ > > _______________________________________________ > Linux-HA mailing list > Linux-HA@lists.linux-ha.org > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list Linux-HA@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems