Hello,

We have upgraded our cluster to Slurm 23.11.1 then, a few weeks later, to 
23.11.4. Since then, Slurm doesn't detect hyperthreaded CPUs. We have 
downgraded our test cluster, the issue is not present with Slurm 22.05 (we had 
skipped Slurm 23.02).

For example, we are working with this node:

$ slurmd -C
NodeName=node03 CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 
ThreadsPerCore=2 RealMemory=128215

It is defined like this in slurm.conf:

SelectTypeParameters=CR_CPU_Memory
TaskPlugin=task/cgroup,task/affinity
NodeName=node03 CPUs=40 RealMemory=150000 Feature=htc MemSpecLimit=5000
NodeSet=htc Feature=htc
PartitionName=htc Default=YES MinNodes=0 MaxNodes=1 Nodes=htc DefMemPerCPU=1000 
State=UP LLN=Yes MaxMemPerNode=142000

So no oversubscribing, 20 cores and 40 CPUs thanks to hyperthreading. Until the 
updgrade, Slurm was allocating those 40 CPUs: when launching 40 jobs of 1 CPU, 
each of those job would use one 1 CPU. This is the expected behavior.

Since the upgrade, we can still launch those 40 jobs, but only the first half 
of the CPUs will be used (CPUs 0 to 19 according to htop). Each of those CPUs 
is used by 2 jobs, and the second half of the CPUs (#20 to 39) stay completely 
idle. When launching 40 stress processes directly in the node without using 
Slurm all the CPUs are used.

When allocating a specific CPU with srun, it works until CPU #19 and then an 
error occurs even if the allocation includes all the CPUs of the node:

#SBATCH --ntasks=1
#SBATCH --cpus-per-task=40
# Works for 0 to 19
srun --cpu-bind=v,map_cpu:19 stress.py

# Doesn't work (20 to 39)
srun --cpu-bind=v,map_cpu:20 stress.py
# Output:
srun: error: CPU binding outside of job step allocation, allocated CPUs are: 
0x00000FFFFF.
srun: error: Task launch for StepId=57194.0 failed on node node03: Unable to 
satisfy cpu bind request
srun: error: Application launch failed: Unable to satisfy cpu bind request
srun: Job step aborted

This behaviour concerns all our nodes, some of which have been restarted 
recently and others have not. This causes the jobs to be frequently 
interrupted, augmenting the difference between the system real time and 
user+system times and making the jobs slower. We have been peering the 
documentation but, from what we understand, our configuration seems correct. In 
particular, as advised by the documentation[1], we don't set up ThreadsPerCore 
in slurm.conf.

Are we missing something, or is there a regression or a change in Slurm 
configuration since the version 23.11?

Thank you,
Guillaume

[1] : https://slurm.schedmd.com/slurm.conf.html#OPT_ThreadsPerCore

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

Reply via email to