Hello,

We decided to route all jobs requesting from 1 to 20 cores to our serial queue. 
Furthermore, the nodes controlled by the serial queue are shared by multiple 
users. We did this to try to reduce the level of fragmentation across the 
cluster -- our default "batch" queue provides exclusive access to compute nodes.

It looks like the downside of the serial queue is that jobs from different 
users can interact quite badly. To some extent this is an education issue -- 
for example matlab users need to be told to add the "-singleCompThread" option 
to their command line. On the other hand I wonder if our cgroups setup is 
optimal for the serial queue. Our cgroup.conf contains...

CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm/cgroup"

ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainDevices=yes
TaskAffinity=no

CgroupMountpoint=/sys/fs/cgroup

The relevant cgroup configuration in the slurm.conf is...
ProctrackType=proctrack/cgroup
TaskPlugin=affinity,cgroup

Could someone please advise us on the required/recommended cgroup setup for the 
above scenario? For example, should we really set "TaskAffinity=yes"? I assume 
the interaction between jobs (sometimes jobs can get stalled) is due to context 
switching at the kernel level, however (apart from educating users) how can we 
minimise that switching on the serial nodes?

Best regards,
David

Reply via email to