Ahh... I should have expanded "linux clients" to "linux clients running 
RHEL 3 U5".

[EMAIL PROTECTED] greg]$ rpm -qa util-linux
util-linux-2.11y-31.6

Red Hat support has this to say...

[...snip...]

"I have been looking into this issue and I have found other people are 
experiencing similar behavior.  I also found a fix that was added to the 
util-linux package that I think addresses this issue...... I believe 
this is what Chuck refers to with his comment "and I believe later 
releases of RHEL 3 were fixed to do this"

 >From the upstream package change log:

"RHEL3 util-linux >=2.11y-31.8 should make the default 70s (instead of 
7s) for TCP mounts:

* Wed Jun  8 2005 Steve Dickson <[EMAIL PROTECTED]> 2.11y-31.8
- Changed nfsmount to retry calls to mountd in foreground as
  well as in background (bz# 138775)
- Increased TCP timeouts to 70 secs (bz# 151097)"

I am pretty sure this will fix the problem that you are seeing.  The 
util-linux package in the Red Hat Enterprise Linux AS (v. 3 for x86) 
Beta channel on RHN is version util-linux-2.11y-31.16.i386.rpm, which 
shold have this fix in it."

[...snip...]

The bug/errata

http://rhn.redhat.com/errata/RHBA-2005-626.html

became available in RHEL3 U6.  Sigh.

We skipped U3, U4 (autofs woes) U6 (just finished upgrading from U2->U5 
and dealing with fallout) and recently began using U7 (to support Sun 
x4100 SAS drives).

Thanks,

--Greg

Chuck Lever wrote:
> On 7/11/06, Gregory Baker <[EMAIL PROTECTED]> wrote:
>> We have thousands of linux clients hitting netapp file servers (many
>> 3500 series, clustered) on a local gigabit LAN.  From time to time,
>> applications return "file not found" when attempting to automount a
>> directory and access a file.  An example of this is a long running
>> process, which reads in data, processes it for hours (in which time the
>> filesystem is unmounted) then tries to read more data from that mount
>> point (which causes a "file not found" error in the application).  This
>> occurs about 1/100th of the time.
>>
>> Researching at Netapp turns up this bit by Chuck Lever (Linux NFS
>> contributer)
>>
>> "Using the Linux NFS Client with Network Appliance Filers"
>> http://www.netapp.com/libr ary/tr/3183.pdf  (February 2006)
>>
>> page 10 says...
>>
>> "Due to a bug in the mount command, the default retransmission timeout
>> value on Linux for NFS over TCP is quite small...To obtain standard
>> behavior, we strongly recommend using "timeo=600, retrans=2" explicitly
>> when mounting via TCP."
>>
>> Our defaults (assuming man pages are correct, RedHat Enterprise Linux 3)
>> would be timeo=7, retrans=3, which translates to 7+14+28+56 = 105 tenths
>> of a second (10 seconds).  It appears netapp is suggesting waiting
>> 600+600 = 1200 tenths (120 seconds) before giving up on the mount 
>> command...
> 
> It's important to distinguish two different types of timeouts.
> 
> 1.  The mount operation has timed out.
> 
> 2.  After the mount operation succeeds, an NFS RPC operation has timed out.
> 
> TR-3183 discusses the proper settings for 2, but you are experiencing 1.
> 
> The automounter attempts to mount one of the filer's exports, but the
> mount request times out causing the mounted-on directory to be
> exposed.  Your filer is heavily loaded, and the filer's mountd is
> single-threaded.  The filer may also be experiencing delays when
> requesting information from external servers (like DNS or NIS), in
> which case the mount request is held up at the filer.
> 
> Both sides are at fault:  the Linux mount command should retry (and I
> believe later releases of RHEL 3 were fixed to do this) and the filer
> configuration should be reviewed to make sure there are no avoidable
> delays while processing mount requests.
> 
>> * What "bug" in the mount command do you believe NetApp is talking about?
> 
> The bug is that the mount command overrides the proper default RPC
> timeout value with a timeout value of 0.7 seconds.  This is *not* the
> timeout for mount operations, it is the timeout for the in-kernel NFS
> client to retransmit RPC requests.
> 
>> * What do you think proper options for NFS auto/mounts would be for
>> extremely busy centralized NFS filers?
> 
> If you are using NFS over TCP, the proper timeout value is 60 seconds.
> 
>> * What is the reference standard behavior?
> 
> Solaris, which is the NFSv3 reference implementation, uses effectively
> a 60 second timeout on TCP mounts.
> 

-- 
----------------------------------------------------------------------
Greg Baker                                         512-602-3287 (work)
[EMAIL PROTECTED]                              512-602-6970 (fax)
5900 E. Ben White Blvd MS 626                      512-555-1212 (info)
Austin, TX 78741



_______________________________________________
autofs mailing list
[email protected]
http://linux.kernel.org/mailman/listinfo/autofs

Reply via email to