[Linux-ha-dev] R: [PATCH] Filesystem RA:

Guglielmo Abbruzzese Tue, 09 Apr 2013 07:30:32 -0700

Hi everybody,
In my case (very similar to Junko's) when I disconnect the Fibre Channels
the "try_umount" procedure in RA Filesystem script doesn't work.


After the programmed attempts the active/passive cluster doesn't swap, and
the lvmdir resource is flagged as "failed" rather than "stopped". 

I must say, even if I try to umount the /storage resource manually it
doesn't work because of sybase is using some files stored on it (busy); this
is why the RA cannot complete the operation in a clean mode. Is there a way
to force the swap anyway?

Some issues. I already tried:
1) This very test with a different optical SAN/storage in the past, and the
RA could always umount correctly the storage;
2) I modified the RA forcing the option "umount -l" even in case I've got a
ext4 FR rather than NFS;
3) I killed the hanged processes with the command "fuser -km /storage"  but
the umount option always failed, and after a while I obtained a kernel panic

Is there a way to force the swap anyway, even if the umount is not clean?
Any suggestion?

Thanks for your time,
Regards
Guglielmo

P.S. lvmdir resource configuration

<primitive class="ocf" id="resource_lvmdir" provider="heartbeat"
type="Filesystem">
          <instance_attributes id="resource_lvmdir-instance_attributes">
            <nvpair id="resource_lvmdir-instance_attributes-device"
name="device" value="/dev/VG_SDG_Cluster_RM/LV_SDG_Cluster_RM"/>
            <nvpair id="resource_lvmdir-instance_attributes-directory"
name="directory" value="/storage"/>
            <nvpair id="resource_lvmdir-instance_attributes-fstype"
name="fstype" value="ext4"/>
          </instance_attributes>
          <meta_attributes id="resource_lvmdir-meta_attributes">
            <nvpair id="resource_lvmdir-meta_attributes-multiple-active"
name="multiple-active" value="stop_start"/>
            <nvpair id="resource_lvmdir-meta_attributes-migration-threshold"
name="migration-threshold" value="1"/>
            <nvpair id="resource_lvmdir-meta_attributes-failure-timeout"
name="failure-timeout" value="0"/>
          </meta_attributes>
          <operations>
            <op enabled="true" id="resource_lvmdir-startup" interval="60s"
name="monitor" on-fail="restart" requires="nothing" timeout="40s"/>
            <op id="resource_lvmdir-start-0" interval="0" name="start"
on-fail="restart" requires="nothing" timeout="180s"/>
            <op id="resource_lvmdir-stop-0" interval="0" name="stop"
on-fail="restart" requires="nothing" timeout="180s"/>
          </operations>
</primitive>

2012/5/9 Junko IKEDA <tsukishima...@gmail.com>:
> Hi,
>
> In my case, the umount succeed when the Fibre Channels is 
> disconnected, so it seemed that the handling status file caused a 
> longer failover, as Dejan said.
> If the umount fails, it will go into a timeout, might call stonith 
> action, and this case also makes sense (though I couldn't see this).
>
> I tried the following setup;
>
> (1) timeout : multipath > RA
> multipath timeout = 120s
> Filesystem RA stop timeout = 60s
>
> (2) timeout : multipath < RA
> multipath timeout = 60s
> Filesystem RA stop timeout = 120s
>
> case (1), Filesystem_stop() fails. The hanging FC causes the stop timeout.
>
> case (2), Filesystem_stop() succeeds.
> Filesystem is hanging out, but line 758 and 759 succeed(rc=0).
> The status file is no more inaccessible, so it remains on the 
> filesystem, in fact.
>
>> > 758 if [ -f "$STATUSFILE" ]; then
>> > 759 rm -f ${STATUSFILE}
>> > 760 if [ $? -ne 0 ]; then
>
> so, the line 761 might not be called as expected.
>
>> > 761 ocf_log warn "Failed to remove status file ${STATUSFILE}."
>
>
> By the way, my concern is the unexpected stop timeout and the longer 
> fail over time, if OCF_CHECK_LEVEL is set as 20, it would be better to 
> try remove its status file just in case.
> It can handle the case (2) if the user wants to recover this case with
STONITH.
>
>
> Thanks,
> Junko
>
> 2012/5/8 Dejan Muhamedagic <de...@suse.de>:
>> Hi Lars,
>>
>> On Tue, May 08, 2012 at 01:35:16PM +0200, Lars Marowsky-Bree wrote:
>>> On 2012-05-08T12:08:27, Dejan Muhamedagic <de...@suse.de> wrote:
>>>
>>> > > In the default (without OCF_CHECK_LEVE), it's enough to try 
>>> > > unmount the file system, isn't it?
>>> > > https://github.com/ClusterLabs/resource-agents/blob/master/heart
>>> > > beat/Filesystem#L774
>>> >
>>> > I don't see a need to remove the STATUSFILE at all, as that may 
>>> > (and as you observed it) prevent the filesystem from stopping.
>>> > Perhaps to skip it altogether? If nobody objects let's just remove 
>>> > this code:
>>> >
>>> >  758         if [ -f "$STATUSFILE" ]; then
>>> >  759             rm -f ${STATUSFILE}
>>> >  760             if [ $? -ne 0 ]; then
>>> >  761                 ocf_log warn "Failed to remove status file
${STATUSFILE}."
>>> >  762             fi
>>> >  763         fi
>>>
>>> That would mean you can no longer differentiate between a "crash" 
>>> and a clean unmount.
>>
>> One could take a look at the logs. I guess that a crash would 
>> otherwise be noticeable as well :)
>>
>>> A hanging FC/SAN is likely to be unable to flush any other dirty 
>>> buffers too, as well, so the umount may not necessarily succeed w/o 
>>> errors. I think it's unreasonable to expect that the node will 
>>> survive such a scenario w/o recovery.
>>
>> True. However, in case of network attached storage or other transient 
>> errors it may lead to an unnecessary timeout followed by fencing, 
>> i.e. the chance for a longer failover time is higher.
>> Just leaving a file around may not justify the risk.
>>
>> Junko-san, what was your experience?
>>
>> Cheers,
>>
>> Dejan
>>
>>> Regards,
>>>     Lars
>>>
>>> --
>>> Architect Storage/HA
>>> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix 
>>> Imendörffer, HRB 21284 (AG Nürnberg) "Experience is the name 
>>> everyone gives to their mistakes." -- Oscar Wilde
>>>
>>> _______________________________________________________
>>> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org 
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
>>> Home Page: http://linux-ha.org/
>> _______________________________________________________
>> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org 
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
>> Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

[Linux-ha-dev] R: [PATCH] Filesystem RA:

Reply via email to