Re: [Linux-ha-dev] [PATCH] Filesystem RA: remove a status file only when OCF_CHECK_LEVEL is set as 20

Junko IKEDA Sun, 13 May 2012 17:03:21 -0700

Hi,

Is my case hard to understand?
"multipath" means the Fibre Channels, there are two cables for redundancy.


Thanks,
Junko

2012/5/9 Junko IKEDA <[email protected]>:
> Hi,
>
> In my case, the umount succeed when the Fibre Channels is disconnected,
> so it seemed that the handling status file caused a longer failover,
> as Dejan said.
> If the umount fails, it will go into a timeout, might call stonith
> action, and this case also makes sense (though I couldn't see this).
>
> I tried the following setup;
>
> (1) timeout : multipath > RA
> multipath timeout = 120s
> Filesystem RA stop timeout = 60s
>
> (2) timeout : multipath < RA
> multipath timeout = 60s
> Filesystem RA stop timeout = 120s
>
> case (1), Filesystem_stop() fails. The hanging FC causes the stop timeout.
>
> case (2), Filesystem_stop() succeeds.
> Filesystem is hanging out, but line 758 and 759 succeed(rc=0).
> The status file is no more inaccessible, so it remains on the
> filesystem, in fact.
>
>> > 758 if [ -f "$STATUSFILE" ]; then
>> > 759 rm -f ${STATUSFILE}
>> > 760 if [ $? -ne 0 ]; then
>
> so, the line 761 might not be called as expected.
>
>> > 761 ocf_log warn "Failed to remove status file ${STATUSFILE}."
>
>
> By the way, my concern is the unexpected stop timeout and the longer
> fail over time,
> if OCF_CHECK_LEVEL is set as 20, it would be better to try remove its
> status file just in case.
> It can handle the case (2) if the user wants to recover this case with 
> STONITH.
>
>
> Thanks,
> Junko
>
> 2012/5/8 Dejan Muhamedagic <[email protected]>:
>> Hi Lars,
>>
>> On Tue, May 08, 2012 at 01:35:16PM +0200, Lars Marowsky-Bree wrote:
>>> On 2012-05-08T12:08:27, Dejan Muhamedagic <[email protected]> wrote:
>>>
>>> > > In the default (without OCF_CHECK_LEVE), it's enough to try unmount
>>> > > the file system, isn't it?
>>> > > https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/Filesystem#L774
>>> >
>>> > I don't see a need to remove the STATUSFILE at all, as that may
>>> > (and as you observed it) prevent the filesystem from stopping.
>>> > Perhaps to skip it altogether? If nobody objects let's just
>>> > remove this code:
>>> >
>>> >  758         if [ -f "$STATUSFILE" ]; then
>>> >  759             rm -f ${STATUSFILE}
>>> >  760             if [ $? -ne 0 ]; then
>>> >  761                 ocf_log warn "Failed to remove status file 
>>> > ${STATUSFILE}."
>>> >  762             fi
>>> >  763         fi
>>>
>>> That would mean you can no longer differentiate between a "crash" and a
>>> clean unmount.
>>
>> One could take a look at the logs. I guess that a crash would
>> otherwise be noticeable as well :)
>>
>>> A hanging FC/SAN is likely to be unable to flush any other dirty buffers
>>> too, as well, so the umount may not necessarily succeed w/o errors. I
>>> think it's unreasonable to expect that the node will survive such a
>>> scenario w/o recovery.
>>
>> True. However, in case of network attached storage or other
>> transient errors it may lead to an unnecessary timeout followed
>> by fencing, i.e. the chance for a longer failover time is higher.
>> Just leaving a file around may not justify the risk.
>>
>> Junko-san, what was your experience?
>>
>> Cheers,
>>
>> Dejan
>>
>>> Regards,
>>>     Lars
>>>
>>> --
>>> Architect Storage/HA
>>> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, 
>>> HRB 21284 (AG Nürnberg)
>>> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>>>
>>> _______________________________________________________
>>> Linux-HA-Dev: [email protected]
>>> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
>>> Home Page: http://linux-ha.org/
>> _______________________________________________________
>> Linux-HA-Dev: [email protected]
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
>> Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] [PATCH] Filesystem RA: remove a status file only when OCF_CHECK_LEVEL is set as 20

Reply via email to