Re: [Linux-ha-dev] [PATCH] Filesystem RA: remove a status file only when OCF_CHECK_LEVEL is set as 20

Junko IKEDA Tue, 08 May 2012 19:29:25 -0700

Hi,

In my case, the umount succeed when the Fibre Channels is disconnected,
so it seemed that the handling status file caused a longer failover,
as Dejan said.
If the umount fails, it will go into a timeout, might call stonith
action, and this case also makes sense (though I couldn't see this).


I tried the following setup;

(1) timeout : multipath > RA
multipath timeout = 120s
Filesystem RA stop timeout = 60s

(2) timeout : multipath < RA
multipath timeout = 60s
Filesystem RA stop timeout = 120s

case (1), Filesystem_stop() fails. The hanging FC causes the stop timeout.

case (2), Filesystem_stop() succeeds.
Filesystem is hanging out, but line 758 and 759 succeed(rc=0).
The status file is no more inaccessible, so it remains on the
filesystem, in fact.

> > 758 if [ -f "$STATUSFILE" ]; then
> > 759 rm -f ${STATUSFILE}
> > 760 if [ $? -ne 0 ]; then

so, the line 761 might not be called as expected.

> > 761 ocf_log warn "Failed to remove status file ${STATUSFILE}."


By the way, my concern is the unexpected stop timeout and the longer
fail over time,
if OCF_CHECK_LEVEL is set as 20, it would be better to try remove its
status file just in case.
It can handle the case (2) if the user wants to recover this case with STONITH.


Thanks,
Junko

2012/5/8 Dejan Muhamedagic <[email protected]>:
> Hi Lars,
>
> On Tue, May 08, 2012 at 01:35:16PM +0200, Lars Marowsky-Bree wrote:
>> On 2012-05-08T12:08:27, Dejan Muhamedagic <[email protected]> wrote:
>>
>> > > In the default (without OCF_CHECK_LEVE), it's enough to try unmount
>> > > the file system, isn't it?
>> > > https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/Filesystem#L774
>> >
>> > I don't see a need to remove the STATUSFILE at all, as that may
>> > (and as you observed it) prevent the filesystem from stopping.
>> > Perhaps to skip it altogether? If nobody objects let's just
>> > remove this code:
>> >
>> >  758         if [ -f "$STATUSFILE" ]; then
>> >  759             rm -f ${STATUSFILE}
>> >  760             if [ $? -ne 0 ]; then
>> >  761                 ocf_log warn "Failed to remove status file 
>> > ${STATUSFILE}."
>> >  762             fi
>> >  763         fi
>>
>> That would mean you can no longer differentiate between a "crash" and a
>> clean unmount.
>
> One could take a look at the logs. I guess that a crash would
> otherwise be noticeable as well :)
>
>> A hanging FC/SAN is likely to be unable to flush any other dirty buffers
>> too, as well, so the umount may not necessarily succeed w/o errors. I
>> think it's unreasonable to expect that the node will survive such a
>> scenario w/o recovery.
>
> True. However, in case of network attached storage or other
> transient errors it may lead to an unnecessary timeout followed
> by fencing, i.e. the chance for a longer failover time is higher.
> Just leaving a file around may not justify the risk.
>
> Junko-san, what was your experience?
>
> Cheers,
>
> Dejan
>
>> Regards,
>>     Lars
>>
>> --
>> Architect Storage/HA
>> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, 
>> HRB 21284 (AG Nürnberg)
>> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>>
>> _______________________________________________________
>> Linux-HA-Dev: [email protected]
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
>> Home Page: http://linux-ha.org/
> _______________________________________________________
> Linux-HA-Dev: [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/
_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Re: [Linux-ha-dev] [PATCH] Filesystem RA: remove a status file only when OCF_CHECK_LEVEL is set as 20

Reply via email to