We have the same problem with our file servers when they become overloaded or get rebooted. There seems to be at least two issues that need to be dealt with. The first issue seems to be the mount/mount.nfs does not seem to work as advertised. Here is an example of a mount failing within 3 seconds:
# time mount lidx:/export /mnt mount.nfs: mount to NFS server 'lidx:/export' failed: System Error: No route to host 0.000u 0.004s 0:03.00 0.0% 0+0k 0+0io 0pf+0w man nfs says: retry=n The number of minutes that the mount(8) command retries an NFS mount operation in the foreground or background before giving up. If this option is not specified, the default value for foreground mounts is 2 minutes, and the default value for background mounts is 10000 minutes (80 minutes shy of one week)... 3 seconds instead of 2 minutes if the host is on the same network. The second issue is that -hosts mount does not seem to want to wait very long either: Here is an example of /net mount. The -hosts option seems to wait only 3 seconds then caches the negative entry. # grep ^/net /etc/auto.master /net -hosts Here it is taking 3 seconds: # time ls /net/google.com ls: /net/google.com: No such file or directory 0.000u 0.000s 0:03.03 0.0% 0+0k 24+0io 0pf+0w # time ls /net/google.com ls: /net/google.com: No such file or directory 0.000u 0.004s 0:00.00 0.0% 0+0k 0+0io 0pf+0w # time showmount -e google.com portmap getport: RPC: Timed out 0.000u 0.000s 3:12.01 0.0% 0+0k 40+0io 0pf+0w I found this to be fixable by the following patch which changes the UDP timeout from 3 seconds to 30 and TCP timeout from 5 seconds to 50. I am not sure what a good value would be but our file servers can take a couple minutes to reboot. diff --git a/include/rpc_subs.h b/include/rpc_subs.h index 87fd568..d3c1d9f 100644 --- a/include/rpc_subs.h +++ b/include/rpc_subs.h @@ -39,8 +39,8 @@ #define RPC_CLOSE_ACTIVE RPC_CLOSE_DEFAULT #define RPC_CLOSE_NOLINGER 0x0001 -#define PMAP_TOUT_UDP 3 -#define PMAP_TOUT_TCP 5 +#define PMAP_TOUT_UDP 30 +#define PMAP_TOUT_TCP 50 Program mounts seem to wait long enough. You can write something like to auto.net script which could be modified to wait for the showmount to get something back or retry on failures so that if a file server is down that it might give automount something to mount. I have also found that /usr/sbin/showmount which can be used by /etc/auto.net has changed its behavior at some point. It use to be 3 minutes+ for it to timeout: % time showmount -e 10.1.1.1 portmap getport: RPC: Timed out 0.000u 0.004s 3:12.00 0.0% 0+0k 40+0io 0pf+0w % rpm -qf /usr/sbin/showmount nfs-client-1.1.0-11 Now a newer version seems to give up in 13 seconds for some reason... % time showmount -e 10.1.1.1 showmount: RPC: Timed out 0.000u 0.004s 0:13.00 0.0% 0+0k 0+0io 0pf+0w % rpm -qf /usr/sbin/showmount nfs-client-1.1.3-14.1 So to summarize, mount failures due to timeouts seem do have at least 5 different values in my environment. The range seems to be between 3 seconds to greater 3 minutes depending on the type of map and the binaries used. The following is automount entry has a time out which is determined by /usr/sbin/automount and seems to be 3 seconds if the host is unreachable: /nethosts -hosts The following entry /etc/auto.net will fail if showmount or kshowmount times out. This seems to vary between 3+ minutes to 13 seconds depending on the version of showmount: /netauto /etc/auto.net A plain file timeout seems to be controlled by the version of mount/mount.nfs. My version of mount.nfs seems to wait 31 seconds before giving up if the host is unreachable on a remote network, but if the machine is on the same network mount.nfs times out after 3 seconds with a "No route to host" message. # mount.nfs 10.1.1.1:/export /mnt mount.nfs: mount to NFS server '10.1.1.1:/export' failed: timed out, giving up 0.00user 0.00system 0:31.02elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+8outputs (0major+197minor)pagefaults 0swaps # mount.nfs 192.168.1.222:/export /mnt mount.nfs: mount to NFS server '192.168.1.222:/export' failed: System Error: No route to host 0.00user 0.00system 0:03.00elapsed 0%CPU (0avgtext+0avgdata 0maxresid Here is some output from showing 4 of the results: % time ls /nethosts/10.1.1.1 ls: cannot access /nethosts/10.1.1.1: No such file or directory 0.000u 0.000s 0:03.00 0.0% 0+0k 0+0io 0pf+0w % time ls /netauto/10.1.1.1 ls: cannot access /netauto/10.1.1.1: No such file or directory 0.000u 0.000s 0:13.01 0.0% 0+0k 0+0io 0pf+0w % time ls /netfile/10.1.1.1 ls: cannot open directory /netfile/10.1.1.1: No such file or directory 0.000u 0.004s 0:31.03 0.0% 0+0k 0+0io 0pf+0w And the most patient version is a machine with an earlier version of showmount. % time ls /netauto/10.1.1.1 ls: /netauto/10.1.1.1: No such file or directory 0.004u 0.000s 3:12.04 0.0% 0+0k 0+0io 0pf+0w steve On Mon, Jul 6, 2009 at 6:46 PM, Ian Kent<[email protected]> wrote: > Filipe Brandenburger wrote: >> Hi Ian, >> >> Ian Kent wrote: >>> Filipe Brandenburger wrote: >>>> Recently I had failures in some hosts when mounting home directories, in >>>> some cases more than one host at a time. >>> >>> This does sound a bit like a known problem. >>> A bunch of patches have gone into RHEL-4 U8 which resolved almost all >>> reported problems. You will need to log a bug or update to U8 to check. >> >> Thanks for your answer. >> >> Would upgrading autofs from 4.1.3-234 to 4.1.3-238 be enough, or do I >> need to upgrade to the latest kernel as well? > > Above I was actually referring to the kernel, although I didn't make > that clear. The RHEL-4.8 kernel update fixed almost all reported kernel > related problems (there were a couple). > >> >> Would using the "autofs5" package in RHEL4 be better in this sense? Do >> you know if that package is as stable as autofs 4 is in RHEL4? > > That a trick question, right? > As the maintainer I will always recommend version 5. > > The RHEL-4 autofs5 package is essentially the same as the RHEL-5 autofs > package except that it's a release behind, as updates are back ported. > The back porting of RHEL-5 updates will be somewhat more selective from > RHEL-4.9 onward. > > Ian > > _______________________________________________ > autofs mailing list > [email protected] > http://linux.kernel.org/mailman/listinfo/autofs > _______________________________________________ autofs mailing list [email protected] http://linux.kernel.org/mailman/listinfo/autofs
