[slurm-users] job not running if partition MaxCPUsPerNode < actual max

Bernstein, Noam CIV USN NRL (6393) Washington DC (USA) Tue, 15 Aug 2023 09:48:07 -0700

We have a heterogeneous mix of nodes, most 32 core, but one group of 36 core, 
grouped into homogeneous partitions.  We like to be able to specify multiple 
partitions so that a job can run on any homogeneous group.  It would be nice if 
we could run on all such nodes using 32 cores per node.  To try to do this, I 
created a partition for the 36-core nodes (call them n2019) which specifies a 
max cpu # of 64
PartitionName=n2019            DefMemPerCPU=2631 Nodes=compute-4-[0-47]
PartitionName=n2019_32         DefMemPerCPU=2631 Nodes=compute-4-[0-47] 
MaxCPUsPerNode=64
PartitionName=n2021            DefMemPerCPU=2960 Nodes=compute-7-[0-18]


However, if I try to run a 128 task, 1 task per core job on n2019_32, the 
sbatch fails with
> sbatch  --ntasks=128 --exclusive --partition=n2019_32  --ntasks-per-core=1 
> job.pbs
sbatch: error: Batch job submission failed: Requested node configuration is not 
available
(please ignore the ".pbs" - it's a relic, and the job script works with slurm). 
The identical command but with "n2019" or "n2021" for the partition works (but 
the former uses 36 cores per node). If I specify multiple partitions it will 
only actually run when the non-n2019 (same node set as n2019_32) nodes are 
available.

The job header includes only walltime, job name and stdout/stderr files, shell, 
and a job array range.

I tried to add "-v" to the sbatch to see if that gives more useful info, but I 
couldn't get any more insight.  Does anyone have any idea why it's rejecting my 
job?

thanks,
Noam

[slurm-users] job not running if partition MaxCPUsPerNode < actual max

Reply via email to