This sounds a lot like the mount command is not retrying a mount when it gets a timed out RPC. Networking problems or an overloaded mountd on the server would both be reasons for an RPC timeout during a mount.
Jeff, is the mount patch we worked on last summer available for RHEL 3, or is it just a RHEL AS 2.1 fix at this point? Peter, what release of Data ONTAP is running on the filer(s)? > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] > Sent: Friday, February 04, 2005 1:58 PM > To: [email protected] > Cc: [EMAIL PROTECTED] > Subject: RE: [autofs] unacceptable bug in autofs kernel module > > Hello, all, > > Funny you should mention - I was just getting ready to ask about this. > > We are doing the same thing, i.e. submitting jobs via LSF. > What we see are file not found errors when trying to access a > file somewhere down in the tree of an automounted file > system. For instance, a job will execute a Perl script that > starts with "#!/tools/perl5.8.3/bin/perl", which fails > because it cannot find the Perl executable. I log into the > machine and do "ls /tools/perl5.8.3/bin/perl" and get a file > not found. I check /etc/mnttab or /proc/mounts and > /tools/perl5.8.3 is not mounted. So then I do an ls of > /tools/perl5.8.3 and the mount is made. Once I do that, the > mount point is generally well behaved for some random period > of time when we will go through all this again. > > At first we thought it was networking problems because we > were also seeing some "server not responding" errors on our > Solaris boxes. We found that if the mount failed with an RPC > timeout, then the automounter would not try again until you > did an ls of the mount point directory (or in some cases, you > would have to cd to the directory to get the mount to > happen). We have fixed some networking problems that we > found and the number of these kinds of error messages has > gone way down. Now we only see them when the 10 boxes all > run a cron job at 10PM and try to mount the same file system > at the same time. Some win but most lose. > > Testing (60 second expiry, multiple jobs accessing files > every 2 to 3 minutes; caused lots of expirations and > remounts) showed that we could also lose track of a mount if > the mount expired and then immediately remounted. > Well, it would not remount but the automounter thought it > had. Similarly to the above, and ls or cd would fix the problem. > > Occasionally, the automounter fails to mount without any > indication that I can find in /var/log/messages. And, again, > an ls or cd of the directory will cause the mount to happen. > > Most of the machines are running Red Hat EL 3 U4 (automount > 4.1.3-47, 2.4.21-27.0.1ELhugemem/smp kernel). One is running > 4.1.3-12. A couple are running RHEL 3 U0, 2.4.21-4EL kernel, > 4.1.0-2 automouunt. We have several IBM blades with P4's and > mostly 4GB of memory. We also have one HP DL585 running > AMD64 with 16GB of memory. Most run with a 10 minute expiry, > but one is set to 30 minutes and one to 1 hour. That does > not seem to affect the error rate. Some are running soft > mounts to the tools (which should be read only) and some are > running hard mounts - this too does not seem to make a difference. > > And, oh yes, these mounts are all from NetApp Filers. > > Anybody else see this and/or have any ideas? > > > Pete Harris > Tektronix, Inc. > Technical Computing > MS 39-325 / PO BOX 500 / BEAVERTON OR 97077-0500 > Phone: 1-503-627-3989 > Fax: 1-503-627-5587 > ---------------------------------------------------------------------- > -- Any opinions expressed are those of the author -- > -- and may not be those of Tektronix, Inc. -- > > =-----Original Message----- > =From: [EMAIL PROTECTED] [mailto:autofs- > [EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED] > =Sent: Thursday, February 03, 2005 4:39 PM > =To: [EMAIL PROTECTED] > =Cc: [email protected] > =Subject: Re: [autofs] unacceptable bug in autofs kernel > module = =On 28 Dec, ramana wrote: > = > => Here is the bug in autofs3 module which causing so much > pain. It simply => stopped me from adding much more > interesting features to Autodir => http://www.intraperson.com/autodir/ > =[snip] > => Because of this, user space test program reporting like this: > => > => fail : /test/t944 : No such file or directory => fail : > /test/t4187 : No such file or directory = =Hmm.. I wonder if > this might be related to a weirdness we're seeing. > =Running > =autofs-4.1.3 with previous latest patch to kernel (pre-2005 > release) and =users =use LSF to submit batch jobs to hosts. > On linux hosts, user level =programs =will sometimes exit > quickly with a "file does not exist" error, even =though you > =can login to the host and see the file/dir just fine. As a > hacked =work-around, we have a pre-exec script that tries to > stat all the =directories =they need to force the mounts to > happen before their program touches the =files. > = > =I didn't see any attempts to patch this bit.. did you have > any ideas on =how to > =patch that particular piece of code? Or just comment it out? > = > =-- > =Mike Marion-Unix SysAdmin/Staff > Engineer-http://www.qualcomm.com =Groundskeeper Willie: > "oooh.. Me mule wouldn't walk in the mud. So I had =to =put > 17 bullets in 'em." ==> Simpsons = > =_______________________________________________ > =autofs mailing list > [EMAIL PROTECTED] > =http://linux.kernel.org/mailman/listinfo/autofs > > _______________________________________________ > autofs mailing list > [email protected] > http://linux.kernel.org/mailman/listinfo/autofs > _______________________________________________ autofs mailing list [email protected] http://linux.kernel.org/mailman/listinfo/autofs
