Re: [Beowulf] Troubleshooting NFS stale file handles

Prentice Bisbal Wed, 19 Apr 2017 10:58:21 -0700

Here's the sequence of events:

1. First job(s) run fine on the node and complete without error.

2. Eventually a job fails with a 'permission denied' error when it triesto access /l/hostname.

Since no jobs fail with a file I/O error, it's hard to confirm that thejobs themselves are causing the problem. However, if these particularjobs are the only thing running on the cluster and should be the onlyjobs accessing these NFS shares, what else could be causing them.

All these systems are getting their user information from LDAP. Sincesome jobs run before these errors appear, lack of, or inaccurate userinfo doesn't seem to be a likely source of this problem, but I'm notruling anything out at this point.


Important detail: This is NFSv3.

Prentice Bisbal
Lead Software Engineer
Princeton Plasma Physics Laboratory
http://www.pppl.gov

On 04/19/2017 12:20 PM, Ryan Novosielski wrote:

Are you saying they can’t mount the filesystem, or they can’t write to a 
mounted filesystem? Where does this system get its user information from, if 
the latter?

--
____
|| \\UTGERS,     |---------------------------*O*---------------------------
||_// the State  |         Ryan Novosielski - [email protected]
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ  | Office of Advanced Research Computing - MSB C630, Newark
      `'

On Apr 19, 2017, at 12:09, Prentice Bisbal <[email protected]> wrote:

Beowulfers,

I've been trying to troubleshoot a problem for the past two weeks with no luck. 
We have a cluster here that runs only one application (although the details of 
that application change significantly from run-to-run.). Each node in the 
cluster has an NFS export, /local, that can be automounted by every other node 
in the cluster as /l/hostname.

Starting about two weeks ago, when jobs would try to access /l/hostname, they 
would get permission denied messages. I tried analyzing this problem by turning 
on all NFS/RPC logging with rpcdebug and also using tcpdump while trying to 
manually mount one of the remote systems. Both approaches indicated state file 
handles were prevent the share from being mounted.

Since it has been 6-8 weeks since there were any seemingly relevant system 
config changes, I suspect it's an application problem (naturally). On the other 
hand, the application developers/users insist that they haven't made any 
changes, to their code, either. To be honest, there's no significant evidence 
indicating either is at fault. Any suggestions on how to debug this and 
definitively find the root cause of these stale file handles?

--
Prentice
_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Troubleshooting NFS stale file handles

Reply via email to