Hi Shawn, I'm wondering if you're still seeing this. I've recently enabled task/cgroup on 17.11.5 running on CentOS 7 and just discovered that jobs are escaping their cgroups. For me this is resulting in a lot of jobs ending in OUT_OF_MEMORY that shouldn't, because it appears slurmd thinks the oom-killer has triggered when it hasn't. I'm not using GRES or devices, only:
cgroup.conf: CgroupAutomount=yes ConstrainCores=yes ConstrainRAMSpace=yes ConstrainSwapSpace=yes slurm.conf: JobAcctGatherType=jobacct_gather/cgroup JobAcctGatherFrequency=task=15 ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup The only thing that seems to maybe correspond are the log messages: [JOB_ID.batch] debug: Handling REQUEST_STATE debug: _fill_registration_msg: found apparently running job JOB_ID Thanks, --nate On Mon, Apr 23, 2018 at 4:41 PM, Kevin Manalo <kman...@jhu.edu> wrote: > Shawn, > > > > Just to give you a compare and contrast: > > > > We have for related entries slurm.conf > > > > JobAcctGatherType=jobacct_gather/linux # will migrate to cgroup eventually > > JobAcctGatherFrequency=30 > > ProctrackType=proctrack/cgroup > > TaskPlugin=task/affinity,task/cgroup > > > > cgroup_allowed_devices_file.conf: > > > > /dev/null > > /dev/urandom > > /dev/zero > > /dev/sda* > > /dev/cpu/*/* > > /dev/pts/* > > /dev/nvidia* > > > > gres.conf (4 K80s on node with 24 core haswell): > > > > Name=gpu File=/dev/nvidia0 CPUs=0-5 > > Name=gpu File=/dev/nvidia1 CPUs=12-17 > > Name=gpu File=/dev/nvidia2 CPUs=6-11 > > Name=gpu File=/dev/nvidia3 CPUs=18-23 > > > > > > I also looked for multi-tenant jobs on our MARCC cluster with jobs > 1 day > and they are still inside of cgroups, but again this is on CentOS6 clusters. > > > > Are you still seeing cgroup escapes now, specifically for jobs > 1 day? > > > > Thanks, > > Kevin > > > > > > > > *From: *slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of > Shawn Bobbin <sabob...@umiacs.umd.edu> > *Reply-To: *Slurm User Community List <slurm-users@lists.schedmd.com> > *Date: *Monday, April 23, 2018 at 2:45 PM > *To: *Slurm User Community List <slurm-users@lists.schedmd.com> > *Subject: *Re: [slurm-users] Jobs escaping cgroup device controls after > some amount of time. > > > > Hi, > > > > I attached our cgroup.conf and gres.conf. > > > > As for the cgroup_allowed_devices.conf file, I have this file stubbed but > empty. In 17.02 slurm started fine without this file (as far as I > remember) and it being empty doesn’t appear to actually impact anything… > device availability remains the same. Based on the behavior explained in > [0] I don’t expect this file to impact specific GPU containment. > > > > TaskPlugin = task/cgroup > > ProctrackType = proctrack/cgroup > > JobAcctGatherType = jobacct_gather/cgroup > > > > > > > > [0] https://bugs.schedmd.com/show_bug.cgi?id=4122 > > > > > > > >