Hello, all,

Funny you should mention - I was just getting ready to ask about this.

We are doing the same thing, i.e. submitting jobs via LSF.  What we see are
file not found errors when trying to access a file somewhere down in the
tree of an automounted file system.  For instance, a job will execute a Perl
script that starts with "#!/tools/perl5.8.3/bin/perl", which fails because
it cannot find the Perl executable.  I log into the machine and do "ls
/tools/perl5.8.3/bin/perl" and get a file not found.  I check /etc/mnttab or
/proc/mounts and /tools/perl5.8.3 is not mounted.  So then I do an ls of
/tools/perl5.8.3 and the mount is made.  Once I do that, the mount point is
generally well behaved for some random period of time when we will go
through all this again.

At first we thought it was networking problems because we were also seeing
some "server not responding" errors on our Solaris boxes.  We found that if
the mount failed with an RPC timeout, then the automounter would not try
again until you did an ls of the mount point directory (or in some cases,
you would have to cd to the directory to get the mount to happen).  We have
fixed some networking problems that we found and the number of these kinds
of error messages has gone way down.  Now we only see them when the 10 boxes
all run a cron job at 10PM and try to mount the same file system at the same
time.  Some win but most lose.

Testing (60 second expiry, multiple jobs accessing files every 2 to 3
minutes; caused lots of expirations and remounts) showed that we could also
lose track of a mount if the mount expired and then immediately remounted.
Well, it would not remount but the automounter thought it had.  Similarly to
the above, and ls or cd would fix the problem.

Occasionally, the automounter fails to mount without any indication that I
can find in /var/log/messages.  And, again, an ls or cd of the directory
will cause the mount to happen.

Most of the machines are running Red Hat EL 3 U4 (automount 4.1.3-47,
2.4.21-27.0.1ELhugemem/smp kernel).  One is running 4.1.3-12.  A couple are
running RHEL 3 U0, 2.4.21-4EL kernel, 4.1.0-2 automouunt.  We have several
IBM blades with P4's and mostly 4GB of memory.  We also have one HP DL585
running AMD64 with 16GB of memory.  Most run with a 10 minute expiry, but
one is set to 30 minutes and one to 1 hour.  That does not seem to affect
the error rate.  Some are running soft mounts to the tools (which should be
read only) and some are running hard mounts - this too does not seem to make
a difference.

And, oh yes, these mounts are all from NetApp Filers.

Anybody else see this and/or have any ideas?


Pete Harris
Tektronix, Inc.
Technical Computing
MS 39-325 / PO BOX 500 / BEAVERTON OR 97077-0500
Phone:  1-503-627-3989
Fax:    1-503-627-5587
----------------------------------------------------------------------
--          Any opinions expressed are those of the author          --
--             and may not be those of Tektronix, Inc.              --

=-----Original Message-----
=From: [EMAIL PROTECTED] [mailto:autofs-
[EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED]
=Sent: Thursday, February 03, 2005 4:39 PM
=To: [EMAIL PROTECTED]
=Cc: [email protected]
=Subject: Re: [autofs] unacceptable bug in autofs kernel module
=
=On 28 Dec, ramana wrote:
=
=> Here is the bug in autofs3 module which causing so much pain. It simply
=> stopped me from adding much more interesting features to Autodir
=> http://www.intraperson.com/autodir/
=[snip]
=> Because of this, user space test program reporting like this:
=>
=> fail : /test/t944 : No such file or directory
=> fail : /test/t4187 : No such file or directory
=
=Hmm.. I wonder if this might be related to a weirdness we're seeing.
=Running
=autofs-4.1.3 with previous latest patch to kernel (pre-2005 release) and
=users
=use LSF to submit batch jobs to hosts.  On linux hosts, user level
=programs
=will sometimes exit quickly with a "file does not exist" error, even
=though you
=can login to the host and see the file/dir just fine.  As a hacked
=work-around, we have a pre-exec script that tries to stat all the
=directories
=they need to force the mounts to happen before their program touches the
=files.
=
=I didn't see any attempts to patch this bit.. did you have any ideas on
=how to
=patch that particular piece of code?   Or just comment it out?
=
=--
=Mike Marion-Unix SysAdmin/Staff Engineer-http://www.qualcomm.com
=Groundskeeper Willie: "oooh.. Me mule wouldn't walk in the mud.  So I had
=to
=put 17 bullets in 'em." ==> Simpsons
=
=_______________________________________________
=autofs mailing list
[EMAIL PROTECTED]
=http://linux.kernel.org/mailman/listinfo/autofs

_______________________________________________
autofs mailing list
[email protected]
http://linux.kernel.org/mailman/listinfo/autofs

Reply via email to