Shawn,

Just to give you a compare and contrast:

We have for related entries slurm.conf

JobAcctGatherType=jobacct_gather/linux # will migrate to cgroup eventually
JobAcctGatherFrequency=30
ProctrackType=proctrack/cgroup
TaskPlugin=task/affinity,task/cgroup

cgroup_allowed_devices_file.conf:

/dev/null
/dev/urandom
/dev/zero
/dev/sda*
/dev/cpu/*/*
/dev/pts/*
/dev/nvidia*

gres.conf (4 K80s on node with 24 core haswell):

Name=gpu File=/dev/nvidia0 CPUs=0-5
Name=gpu File=/dev/nvidia1 CPUs=12-17
Name=gpu File=/dev/nvidia2 CPUs=6-11
Name=gpu File=/dev/nvidia3 CPUs=18-23


I also looked for multi-tenant jobs on our MARCC cluster with jobs > 1 day and 
they are still inside of cgroups, but again this is on CentOS6 clusters.

Are you still seeing  cgroup escapes now, specifically for jobs > 1 day?

Thanks,
Kevin



From: slurm-users <slurm-users-boun...@lists.schedmd.com> on behalf of Shawn 
Bobbin <sabob...@umiacs.umd.edu>
Reply-To: Slurm User Community List <slurm-users@lists.schedmd.com>
Date: Monday, April 23, 2018 at 2:45 PM
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] Jobs escaping cgroup device controls after some 
amount of time.

Hi,

I attached our cgroup.conf and gres.conf.

As for the cgroup_allowed_devices.conf file, I have this file stubbed but 
empty.  In 17.02 slurm started fine without this file (as far as I remember) 
and it being empty doesn’t appear to actually impact anything… device 
availability remains the same.  Based on the behavior explained in [0] I don’t 
expect this file to impact specific GPU containment.

TaskPlugin = task/cgroup
ProctrackType = proctrack/cgroup
JobAcctGatherType = jobacct_gather/cgroup






[0] https://bugs.schedmd.com/show_bug.cgi?id=4122






Reply via email to