Hello,
I am getting started with SLURM and I am having a hard time
understanding how it allocates CPUs to users depending on the resources
they request. The problem I am facing can be summarized as follows.
Consider a bash script test.sh that requests 8 CPUs but actually starts
a job that uses 10 CPUs:
#!/bin/sh
#SBATCH --ntasks=8
stress -c 10
On a server with 32 CPUs, if I start 5 times this script with sbatch
test.sh, 4 of them start running right away and the last one appears as
pending, as shown by the squeue command:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
5 main test.sh jack PD 0:00 1 (Resources)
1 main test.sh jack R 0:08 1 server
2 main test.sh jack R 0:08 1 server
3 main test.sh jack R 0:05 1 server
4 main test.sh jack R 0:05 1 server
The problem is that these 4 jobs are actually using 40 CPUs and overload
the server. I would on the contrary expect SLURM to either not start the
jobs that are actually using more resources than requested by the user,
or to put them on hold until there are enough resources to start them.
How can I make sure that the users of my server do not start jobs that
use too many CPUs?
Some useful details about my slurm.conf file:
# SCHEDULING
#DefMemPerCPU=0
FastSchedule=1
#MaxMemPerCPU=0
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
# COMPUTE NODES
NodeName=server CPUs=32 RealMemory=10000 State=UNKNOWN
# PARTITIONS
PartitionName=main Nodes=server Default=YES Shared=YES
MaxTime=INFINITE State=UP
I am probably making a trivial mistake in the configuration file, of
just misunderstanding a basic concept of SLURM. Any help or advice would
be much appreciated.
Many thanks in advance!