Re: [Linux-HA] Apparent problem in pacemaker ordering

William Seligman Wed, 07 Mar 2012 16:52:34 -0800

On 3/5/12 11:55 AM, William Seligman wrote:
> On 3/3/12 3:30 PM, William Seligman wrote:
>> On 3/3/12 2:14 PM, Florian Haas wrote:
>>> On Sat, Mar 3, 2012 at 6:55 PM, William Seligman
>>> <selig...@nevis.columbia.edu>  wrote:
>>>> On 3/3/12 12:03 PM, emmanuel segura wrote:
>>>>>
>>>>> are you sure the exportfs agent can be use it with clone active/active?
>>>>
>>>> a) I've been through the script. If there's some problem associated with it
>>>> being cloned, I haven't seen it. (It can't handle globally-unique="true",
>>>> but I didn't turn that on.)
>>>
>>> It shouldn't have a problem with being cloned. Obviously, cloning that
>>> RA _really_ makes sense only with the export that manages an NFSv4
>>> virtual root (fsid=0). Otherwise, the export clone has to be hosted on
>>> a clustered filesystem, and you'd have to have a pNFS implementation
>>> that doesn't suck (tough to come by on Linux), and if you want that
>>> sort of replicate, parallel-access NFS you might as well use Gluster.
>>> The downside of the latter, though, is it's currently NFSv3-only,
>>> without sideband locking.
>>
>> I'll look this over when I have a chance. I think I can get away without a 
>> NFSv4
>> virtual root because I'm exporting everything to my cluster either 
>> read-only, or
>> only one system at a time will do any writing. Now that you've warned me, 
>> I'll
>> do some more checking.
>>
>>>> b) I had similar problems using the exportfs resource in a 
>>>> primary-secondary
>>>> setup without clones.
>>>>
>>>> Why would a resource being cloned create an ordering problem? I haven't set
>>>> the interleave parameter (even with the documentation I'm not sure what it
>>>> does) but A before B before C seems pretty clear, even for cloned 
>>>> resources.
>>>
>>> As far as what interleave does. Suppose you have two clones, A and B.
>>> And they're linked with an order constraint, like this:
>>>
>>> order A_before_B inf: A B
>>>
>>> ... then if interleave is false, _all_ instances of A must be started
>>> before _any_ instance of B gets to start anywhere in the cluster.
>>> However if interleave is true, then for any node only the _local_
>>> instance of A needs to be started before it can start the
>>> corresponding _local_ instance of B.
>>>
>>> In other words, interleave=true is actually the reasonable thing to
>>> set on all clone instances by default, and I believe the pengine
>>> actually does use a default of interleave=true on defined clone sets
>>> since some 1.1.x release (I don't recall which).
>>
>> Thanks, Florian. That's a great explanation. I'll probably stick
>> "interleave=true" on most of my clones just to make sure.
>>
>> It explains an error message I've seen in the logs:
>>
>> Mar  2 18:15:19 hypatia-tb pengine: [4414]: ERROR: clone_rsc_colocation_rh:
>> Cannot interleave clone ClusterIPClone and Gfs2Clone because they do not 
>> support
>> the same number of resources per node
>>
>> Because ClusterIPClone has globally-unique=true and clone-max=2, it's 
>> possible
>> for both instances to be running on a single node; I've seen this a few 
>> times in
>> my testing when cycling power on one of the nodes. Interleaving doesn't make
>> sense in such a case.
>>
>>> Bill, seeing as you've already pastebinned your config and crm_mon
>>> output, could you also pastebin your whole CIB as per "cibadmin -Q"
>>> output? Thanks.
>>
>> Sure: <http://pastebin.com/pjSJ79H6>. It doesn't have the exportfs resources 
>> in
>> it; I took them out before leaving for the weekend. If it helps, I'll put 
>> them
>> back in and try to get the "cibadmin -Q" output before any nodes crash.
>>
> 
> For a test, I stuck in a exportfs resource with all the ordering constraints.
> Here's the "cibadmin -Q" output from that:
> 
> <http://pastebin.com/nugdufJc>
> 
> The output of crm_mon just after doing that, showing resource failure:
> 
> <http://pastebin.com/cyCFGUSD>
> 
> Then all the resources are stopped:
> 
> <http://pastebin.com/D62sGSrj>
> 
> A few seconds later one of the nodes is fenced, but this does not bring up
> anything:
> 
> <http://pastebin.com/wzbmfVas>


I believe I have the solution to my stability problem. It doesn't solve the
issue of ordering, but I think I have a configuration that will survive 
failover.

Here's the problem. I had exportfs resources such as:

primitive ExportUsrNevis ocf:heartbeat:exportfs \
        op start interval="0" timeout="40" \
        op stop interval="0" timeout="45" \
        params clientspec="*.nevis.columbia.edu" directory="/usr/nevis" \
        fsid="20" options="ro,no_root_squash,async"

I did detailed traces of the execution of exportfs (putting in logger commands)
and found that the problem was in the backup_rmtab function in exportfs:

backup_rmtab() {
    local rmtab_backup
    if [ ${OCF_RESKEY_rmtab_backup} != "none" ]; then
        rmtab_backup="${OCF_RESKEY_directory}/${OCF_RESKEY_rmtab_backup}"
        grep ":${OCF_RESKEY_directory}:" /var/lib/nfs/rmtab > ${rmtab_backup}
    fi
}

The problem was that the grep command was taking a long time, longer than any
timeout I'd assigned to the resource. I looked at /var/lib/nfs/rmtab, and saw it
was 60GB on one of my nodes and 16GB on the other. Since backup_rmtab() is
called during the "stop" action, the resource could never successfully stop; it
would always timeout. This led to the state shown in the pastebins above: no
amount of pacemaker resource restarting or fencing could fix the problem.

My fixes were:

- "rm /var/lib/nfs/rmtab; touch /var/lib/nfs/rmtab" on both the nodes

- going to all the directories I'd exported and deleting any .rmtab files I 
found

- adding the parameter 'rmtab_backup="none"' to all my exportfs resources. I
believe I can get away with this since I think all my clients are mounting via
NFSv4 and using automount to do it. I'll run some tests with cluster failover
while clients are mounting to be sure.

The next question is: how did /var/lib/nfs/rmtab get so big? When I looked at
the file, I saw the same two clients listed over and over again; I used those
clients to mount an exported partition as a test. Somehow, with cluster failures
and restarts, backup_rmtab() and restore_rmtab() in exportfs got into a loop in
which those client entries were accumulated and never deleted.

Perhaps this could prevented in restore_rmtab() by replacing the line

   cat ${rmtab_backup} >> /var/lib/nfs/rmtab

with something like

   cat ${rmtab_backup} /var/lib/nfs/rmtab | sort -u | cat - >/var/lib/nfs/rmtab

I'll leave it to the experts to determine whether that would work, or if it's
really necessary.
-- 
Bill Seligman             | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://selig...@nevis.columbia.edu
PO Box 137                |
Irvington NY 10533 USA    | http://www.nevis.columbia.edu/~seligman/

smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Apparent problem in pacemaker ordering

Reply via email to