Hello, we encountered something similar at our site. From what I remember it was a problem with NFS client running out of memory for buffer space, application would stall on a 'disk wait', and oomkiller couldn't kill it. In our case the solution was to reserve more memory for the kernel.
best regards Maciej Pawlik 2017-09-03 14:20 GMT+02:00 John Hearns <hear...@googlemail.com>: > No, have never seen anything similar. > A small bit of help - the 'nfswatch' utility is useful for tracking down > NFS problems. ' > Less relevant, but on a system which is running low on memory 'watch cat > /proc/meminfo' is often good for shining a light. > > > On 2 September 2017 at 00:16, Brendan Moloney <moloney.bren...@gmail.com> > wrote: > >> Hello, >> >> I am using cgroups to track processes and limit memory. Occasionally it >> seems like a job will use too much memory and instead of getting killed it >> ends up in a unkillable state waiting for NFS I/O. There are no other >> signs of NFS issues, and in fact other jobs (even on the same node) seem to >> be having no problem communicating with the same NFS server at that same >> time. I just get hung task errors for that one specific process (that used >> too much memory). >> >> Has anyone else ran into this? Searching this mailing list archive I >> found some similar stuff, but that seemed to be in regards to installing >> Slurm itself onto an NFS4 mount rather than just having jobs use an NFS4 >> mount. >> >> Any advice is greatly appreciated. >> >> Thanks, >> Brendan >> > >