No, have never seen anything similar. A small bit of help - the 'nfswatch' utility is useful for tracking down NFS problems. ' Less relevant, but on a system which is running low on memory 'watch cat /proc/meminfo' is often good for shining a light.
On 2 September 2017 at 00:16, Brendan Moloney <moloney.bren...@gmail.com> wrote: > Hello, > > I am using cgroups to track processes and limit memory. Occasionally it > seems like a job will use too much memory and instead of getting killed it > ends up in a unkillable state waiting for NFS I/O. There are no other > signs of NFS issues, and in fact other jobs (even on the same node) seem to > be having no problem communicating with the same NFS server at that same > time. I just get hung task errors for that one specific process (that used > too much memory). > > Has anyone else ran into this? Searching this mailing list archive I found > some similar stuff, but that seemed to be in regards to installing Slurm > itself onto an NFS4 mount rather than just having jobs use an NFS4 mount. > > Any advice is greatly appreciated. > > Thanks, > Brendan >