Hi, Is my case hard to understand? "multipath" means the Fibre Channels, there are two cables for redundancy.
Thanks, Junko 2012/5/9 Junko IKEDA <tsukishima...@gmail.com>: > Hi, > > In my case, the umount succeed when the Fibre Channels is disconnected, > so it seemed that the handling status file caused a longer failover, > as Dejan said. > If the umount fails, it will go into a timeout, might call stonith > action, and this case also makes sense (though I couldn't see this). > > I tried the following setup; > > (1) timeout : multipath > RA > multipath timeout = 120s > Filesystem RA stop timeout = 60s > > (2) timeout : multipath < RA > multipath timeout = 60s > Filesystem RA stop timeout = 120s > > case (1), Filesystem_stop() fails. The hanging FC causes the stop timeout. > > case (2), Filesystem_stop() succeeds. > Filesystem is hanging out, but line 758 and 759 succeed(rc=0). > The status file is no more inaccessible, so it remains on the > filesystem, in fact. > >> > 758 if [ -f "$STATUSFILE" ]; then >> > 759 rm -f ${STATUSFILE} >> > 760 if [ $? -ne 0 ]; then > > so, the line 761 might not be called as expected. > >> > 761 ocf_log warn "Failed to remove status file ${STATUSFILE}." > > > By the way, my concern is the unexpected stop timeout and the longer > fail over time, > if OCF_CHECK_LEVEL is set as 20, it would be better to try remove its > status file just in case. > It can handle the case (2) if the user wants to recover this case with > STONITH. > > > Thanks, > Junko > > 2012/5/8 Dejan Muhamedagic <de...@suse.de>: >> Hi Lars, >> >> On Tue, May 08, 2012 at 01:35:16PM +0200, Lars Marowsky-Bree wrote: >>> On 2012-05-08T12:08:27, Dejan Muhamedagic <de...@suse.de> wrote: >>> >>> > > In the default (without OCF_CHECK_LEVE), it's enough to try unmount >>> > > the file system, isn't it? >>> > > https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/Filesystem#L774 >>> > >>> > I don't see a need to remove the STATUSFILE at all, as that may >>> > (and as you observed it) prevent the filesystem from stopping. >>> > Perhaps to skip it altogether? If nobody objects let's just >>> > remove this code: >>> > >>> > 758 if [ -f "$STATUSFILE" ]; then >>> > 759 rm -f ${STATUSFILE} >>> > 760 if [ $? -ne 0 ]; then >>> > 761 ocf_log warn "Failed to remove status file >>> > ${STATUSFILE}." >>> > 762 fi >>> > 763 fi >>> >>> That would mean you can no longer differentiate between a "crash" and a >>> clean unmount. >> >> One could take a look at the logs. I guess that a crash would >> otherwise be noticeable as well :) >> >>> A hanging FC/SAN is likely to be unable to flush any other dirty buffers >>> too, as well, so the umount may not necessarily succeed w/o errors. I >>> think it's unreasonable to expect that the node will survive such a >>> scenario w/o recovery. >> >> True. However, in case of network attached storage or other >> transient errors it may lead to an unnecessary timeout followed >> by fencing, i.e. the chance for a longer failover time is higher. >> Just leaving a file around may not justify the risk. >> >> Junko-san, what was your experience? >> >> Cheers, >> >> Dejan >> >>> Regards, >>> Lars >>> >>> -- >>> Architect Storage/HA >>> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, >>> HRB 21284 (AG Nürnberg) >>> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde >>> >>> _______________________________________________________ >>> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org >>> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev >>> Home Page: http://linux-ha.org/ >> _______________________________________________________ >> Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org >> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev >> Home Page: http://linux-ha.org/ _______________________________________________________ Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev Home Page: http://linux-ha.org/