This is interesting information, since we now use cgroups on all of our clusters. Large reads/writes with caching on NFS may figure into a job’s memory usage, and may get a job killed that the user does not expect.
-- ____ || \\UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark `' > On Jan 22, 2016, at 05:33, Felip Moll <lip...@gmail.com> wrote: > > Finally I solved the issue in big part thanks to Carlos Fenoy tips. > > The issue was due to NFS filesystem. This filesystem, as CF said, caches data > while other file systems does not. Cgroups takes into account cached data and > our user jobs use NFS filesystem intensivelly. > > I switched from: > ProctrackType=proctrack/cgroup > TaskPlugin=task/cgroup > TaskPluginParam= > > To: > ProctrackType=proctrack/linuxproc > TaskPlugin=task/affinity > TaskPluginParam=Sched > > > And in the following 11 days I didn't receive a single oom kill and > everythink is working perfectly. > > Best regards and thanks to all of you. > Felip M > > > > -- > Felip Moll Marquès > Computer Science Engineer > E-Mail - lip...@gmail.com > WebPage - http://lipix.ciutadella.es > > 2015-12-18 15:09 GMT+01:00 Bjørn-Helge Mevik <b.h.me...@usit.uio.no>: > > Carlos Fenoy <mini...@gmail.com> writes: > > > Barbara, I don't think that is the issue here. The killer is the OOM not > > Slurm, so Slurm is not accounting incorrectly the amount of memory, but it > > seems that the cached memory is also accounted in the cgroup and it is what > > is causing the OOM to kill gzip. > > I've seen cases where the job has copied a set of large files, which > makes the cgroup memory usage go right up to the limit. I guess that is > cached data. Then the job starts computing, without the job getting > killed. My interpretatin is that the kernel will flush the cache when a > process needs more memory instead of killing the process. If I'm > correct, oom will _not_ kill a job due to cached data. > > -- > Regards, > Bjørn-Helge Mevik, dr. scient, > Department for Research Computing, University of Oslo >