Hi,

I believe you can get that message ('Exceeded job memory limit at some point') 
even if the job finishes fine.  When the cgroup is created (by SLURM) it 
updates memory.limit_in_bytes with the job memory request coded in the job.  
During the life of the job the kernel updates a number of files within the 
cgroup, one of which is memory.usage_in_bytes - which is the current memory of 
the cgroup.  Periodically, SLURM will check if the cgroup has exceeded its 
limit (i.e. memory.limit_in_bytes) - the frequency of the check is probably set 
by JobAcctGatherFrequency.  It does this by checking if memory.failcnt is 
greater than one.  The memory.failcnt is incremented by the kernel each time 
memory.usage_in_bytes reaches the value set in memory.limit_in_bytes.

This is the code snippet the produces the error (found in task_cgroup_memory.c) 
…
extern int task_cgroup_memory_check_oom(stepd_step_rec_t *job)
{
...
            else if (failcnt_non_zero(&step_memory_cg,
                          "memory.failcnt"))
                /* reports the number of times that the
                 * memory limit has reached the value set
                 * in memory.limit_in_bytes.
                 */
                error("Exceeded step memory limit at some point.");
...
            else if (failcnt_non_zero(&job_memory_cg,
                          "memory.failcnt"))
                error("Exceeded job memory limit at some point.");
...
}

Anyway, back to the point.  You can see this message and the job not fail 
because the operating system counter (memory.failcnt) that SLURM checks doesn't 
actually mean the memory limit has been exceeded but means the memory limit has 
been reached - a subtle but an important difference.  Important because OOM 
doesn't terminate jobs upon reaching the memory limit, only if they exceed the 
limit, it means the job isn't terminated.  Note: other cgroup files like 
memory.memsw.xxx are also in play if you are using swap space

As to how to manage this.  You can either not use cgroup and use an alternative 
plugin, you could also try the JobAcctGatherParams parameter NoOverMemoryKill 
(the documentation say use this with caution, see 
https://slurm.schedmd.com/slurm.conf.html), or you can try and account for the 
cache by using the jobacct_gather/cgroup.  Unfortunately, because of a bug this 
plugin does report cache usage either.  I've contributed a bug/fix to address 
this (https://bugs.schedmd.com/show_bug.cgi?id=3531).

---
Samuel Gallop
Computing infrastructure for Science
CiS Support & Development

From: Wensheng Deng [mailto:w...@nyu.edu]
Sent: 17 March 2017 13:42
To: slurm-dev <slurm-dev@schedmd.com>
Subject: [slurm-dev] Re: Slurm & CGROUP

The file is copied fine. It is just the message error annoying.



On Thu, Mar 16, 2017 at 8:55 AM, Janne Blomqvist 
<janne.blomqv...@aalto.fi<mailto:janne.blomqv...@aalto.fi>> wrote:
On 2017-03-15 17:52, Wensheng Deng wrote:
> No, it does not help:
>
> $ scontrol show config |grep -i jobacct
>
> *JobAcct*GatherFrequency  = 30
>
> *JobAcct*GatherType       = *jobacct*_gather/cgroup
>
> *JobAcct*GatherParams     = NoShared
>
>
>
>
>
> On Wed, Mar 15, 2017 at 11:45 AM, Wensheng Deng 
> <w...@nyu.edu<mailto:w...@nyu.edu>
> <mailto:w...@nyu.edu<mailto:w...@nyu.edu>>> wrote:
>
>     I think I tried that. let me try it again. Thank you!
>
>     On Wed, Mar 15, 2017 at 11:43 AM, Chris Read 
> <cr...@drw.com<mailto:cr...@drw.com>
>     <mailto:cr...@drw.com<mailto:cr...@drw.com>>> wrote:
>
>
>         We explicitly exclude shared usage from our measurement:
>
>
>         JobAcctGatherType=jobacct_gather/cgroup
>         JobAcctGatherParams=NoShare?
>
>         Chris
>
>
>         ________________________________
>         From: Wensheng Deng <w...@nyu.edu<mailto:w...@nyu.edu> 
> <mailto:w...@nyu.edu<mailto:w...@nyu.edu>>>
>         Sent: 15 March 2017 10:28
>         To: slurm-dev
>         Subject: [ext] [slurm-dev] Re: Slurm & CGROUP
>
>         It should be (sorry):
>         we 'cp'ed a 5GB file from scratch to node local disk
>
>
>         On Wed, Mar 15, 2017 at 11:26 AM, Wensheng Deng 
> <w...@nyu.edu<mailto:w...@nyu.edu>
>         
> <mailto:w...@nyu.edu<mailto:w...@nyu.edu>><mailto:w...@nyu.edu<mailto:w...@nyu.edu>
>         <mailto:w...@nyu.edu<mailto:w...@nyu.edu>>>> wrote:
>         Hello experts:
>
>         We turn on TaskPlugin=task/cgroup. In one Slurm job, we 'cp'ed a
>         5GB job from scratch to node local disk, declared 5 GB memory
>         for the job, and saw error message as below although the file
>         was copied okay:
>
>         slurmstepd: error: Exceeded job memory limit at some point.
>
>         srun: error: [nodenameXXX]: task 0: Out Of Memory
>
>         srun: Terminating job step 41.0
>
>         slurmstepd: error: Exceeded job memory limit at some point.
>
>
>         From the cgroup document
>         https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt
>         <https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt>
>         Features:
>         - accounting anonymous pages, file caches, swap caches usage and
>         limiting them.
>
>         It seems that cgroup charges memory "RSS + file caches" to user
>         process like 'cp', in our case, charged to user's jobs. swap is
>         off in this case. The file cache can be small or very big, and
>         it should not be charged to users'  batch jobs in my opinion.
>         How do other sites circumvent this issue? The Slurm version is
>         16.05.4.
>
>         Thank you and Best Regards.
>
>
>
>
Could you set AllowedRamSpace/AllowedSwapSpace in /etc/slurm/cgroup.conf to 
some big number? That way the job memory limit will be the cgroup soft limit, 
and the cgroup hard limit which is when the kernel will OOM kill the job would 
be "job_memory_limit * AllowedRamSpace" that is, some large value?

--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576<tel:%2B358503841576> || 
janne.blomqv...@aalto.fi<mailto:janne.blomqv...@aalto.fi>

Reply via email to