On 2018-09-07 18:53, Mike Cammilleri wrote:
Hi everyone,

I'm getting this error lately for everyone's jobs, which results in memory not 
being constrained via the cgroups plugin.


slurmstepd: error: task/cgroup: unable to add task[pid=21681] to memory cg 
'(null)'
slurmstepd: error: jobacct_gather/cgroup: unable to instanciate user 3691 
memory cgroup

The result is that no uid_ direcotries are created under /sys/fs/cgroup/memory


Here is our cgroup.conf file:

CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/cgroup"
CgroupMountpoint=/sys/fs/cgroup
ConstrainCores=yes
ConstrainDevices=no
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
AllowedSwapSpace=0

We are using jobacct_gather/cgroup
# ACCOUNTING
JobAcctGatherType=jobacct_gather/cgroup

The partition is configured like this
PartitionName=long Nodes=marzano[05-13] PriorityTier=30 Default=NO MaxTime=5-0 
State=UP OverSubscribe=FORCE:1

We are using slurm 16.05.6 on Ubuntu 14.04 LTS

Any ideas how to get cgroups going again?


This is, apparently, a bug in the Linux kernel where it doesn't garbage collect deleted memory cgroups. Eventually the kernel hits an internal limit on how many memory cgroups there can be, and refuses to create more.

This bug has apparently been fixed in the upstream kernel, but is still present at least in the CentOS 7 kernel, and based on your report, in the Ubuntu 14.04 kernel.

One workaround is to reboot the node whenever this happens. Another is to set ConstrainKmemSpace=no is cgroup.conf (but AFAICS this option was added in slurm 17.02 and is not present in 16.05 that you're using).

For more information, see discussion and links in slurm bug #5082.

--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqv...@aalto.fi

Reply via email to