Ahh... I should have expanded "linux clients" to "linux clients running RHEL 3 U5".
[EMAIL PROTECTED] greg]$ rpm -qa util-linux util-linux-2.11y-31.6 Red Hat support has this to say... [...snip...] "I have been looking into this issue and I have found other people are experiencing similar behavior. I also found a fix that was added to the util-linux package that I think addresses this issue...... I believe this is what Chuck refers to with his comment "and I believe later releases of RHEL 3 were fixed to do this" >From the upstream package change log: "RHEL3 util-linux >=2.11y-31.8 should make the default 70s (instead of 7s) for TCP mounts: * Wed Jun 8 2005 Steve Dickson <[EMAIL PROTECTED]> 2.11y-31.8 - Changed nfsmount to retry calls to mountd in foreground as well as in background (bz# 138775) - Increased TCP timeouts to 70 secs (bz# 151097)" I am pretty sure this will fix the problem that you are seeing. The util-linux package in the Red Hat Enterprise Linux AS (v. 3 for x86) Beta channel on RHN is version util-linux-2.11y-31.16.i386.rpm, which shold have this fix in it." [...snip...] The bug/errata http://rhn.redhat.com/errata/RHBA-2005-626.html became available in RHEL3 U6. Sigh. We skipped U3, U4 (autofs woes) U6 (just finished upgrading from U2->U5 and dealing with fallout) and recently began using U7 (to support Sun x4100 SAS drives). Thanks, --Greg Chuck Lever wrote: > On 7/11/06, Gregory Baker <[EMAIL PROTECTED]> wrote: >> We have thousands of linux clients hitting netapp file servers (many >> 3500 series, clustered) on a local gigabit LAN. From time to time, >> applications return "file not found" when attempting to automount a >> directory and access a file. An example of this is a long running >> process, which reads in data, processes it for hours (in which time the >> filesystem is unmounted) then tries to read more data from that mount >> point (which causes a "file not found" error in the application). This >> occurs about 1/100th of the time. >> >> Researching at Netapp turns up this bit by Chuck Lever (Linux NFS >> contributer) >> >> "Using the Linux NFS Client with Network Appliance Filers" >> http://www.netapp.com/libr ary/tr/3183.pdf (February 2006) >> >> page 10 says... >> >> "Due to a bug in the mount command, the default retransmission timeout >> value on Linux for NFS over TCP is quite small...To obtain standard >> behavior, we strongly recommend using "timeo=600, retrans=2" explicitly >> when mounting via TCP." >> >> Our defaults (assuming man pages are correct, RedHat Enterprise Linux 3) >> would be timeo=7, retrans=3, which translates to 7+14+28+56 = 105 tenths >> of a second (10 seconds). It appears netapp is suggesting waiting >> 600+600 = 1200 tenths (120 seconds) before giving up on the mount >> command... > > It's important to distinguish two different types of timeouts. > > 1. The mount operation has timed out. > > 2. After the mount operation succeeds, an NFS RPC operation has timed out. > > TR-3183 discusses the proper settings for 2, but you are experiencing 1. > > The automounter attempts to mount one of the filer's exports, but the > mount request times out causing the mounted-on directory to be > exposed. Your filer is heavily loaded, and the filer's mountd is > single-threaded. The filer may also be experiencing delays when > requesting information from external servers (like DNS or NIS), in > which case the mount request is held up at the filer. > > Both sides are at fault: the Linux mount command should retry (and I > believe later releases of RHEL 3 were fixed to do this) and the filer > configuration should be reviewed to make sure there are no avoidable > delays while processing mount requests. > >> * What "bug" in the mount command do you believe NetApp is talking about? > > The bug is that the mount command overrides the proper default RPC > timeout value with a timeout value of 0.7 seconds. This is *not* the > timeout for mount operations, it is the timeout for the in-kernel NFS > client to retransmit RPC requests. > >> * What do you think proper options for NFS auto/mounts would be for >> extremely busy centralized NFS filers? > > If you are using NFS over TCP, the proper timeout value is 60 seconds. > >> * What is the reference standard behavior? > > Solaris, which is the NFSv3 reference implementation, uses effectively > a 60 second timeout on TCP mounts. > -- ---------------------------------------------------------------------- Greg Baker 512-602-3287 (work) [EMAIL PROTECTED] 512-602-6970 (fax) 5900 E. Ben White Blvd MS 626 512-555-1212 (info) Austin, TX 78741 _______________________________________________ autofs mailing list [email protected] http://linux.kernel.org/mailman/listinfo/autofs
