Hello,

we encountered something similar at our site. From what I remember it was a
problem with NFS client running out of memory for buffer space, application
would stall on a 'disk wait', and oomkiller couldn't kill it. In our case
the solution was to reserve more memory for the kernel.

best regards
Maciej Pawlik

2017-09-03 14:20 GMT+02:00 John Hearns <hear...@googlemail.com>:

> No, have never seen anything similar.
> A small bit of help - the 'nfswatch' utility is useful for tracking down
> NFS problems. '
> Less relevant, but on a system which is running low on memory 'watch cat
> /proc/meminfo' is often good for shining a light.
>
>
> On 2 September 2017 at 00:16, Brendan Moloney <moloney.bren...@gmail.com>
> wrote:
>
>> Hello,
>>
>> I am using cgroups to track processes and limit memory. Occasionally it
>> seems like a job will use too much memory and instead of getting killed it
>> ends up in a unkillable state waiting for NFS I/O.  There are no other
>> signs of NFS issues, and in fact other jobs (even on the same node) seem to
>> be having no problem communicating with the same NFS server at that same
>> time.  I just get hung task errors for that one specific process (that used
>> too much memory).
>>
>> Has anyone else ran into this? Searching this mailing list archive I
>> found some similar stuff, but that seemed to be in regards to installing
>> Slurm itself onto an NFS4 mount rather than just having jobs use an NFS4
>> mount.
>>
>> Any advice is greatly appreciated.
>>
>> Thanks,
>> Brendan
>>
>
>

Reply via email to