Today we upgraded the controller node from 19.05 to 20.02.3, and immediately all Slurm commands (on the controller node) give error messages for all partitions:

# sinfo --version
sinfo: error: NodeNames=a[001-140] CPUs=1 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
(lines deleted)
slurm 20.02.3

In slurm.conf we have defined NodeName like:

NodeName=a[001-140] Weight=10001 Boards=1 SocketsPerBoard=2 CoresPerSocket=4 ThreadsPerCore=1 ...

According to the slurm.conf manual the CPUs should then be calculated automatically:

"If CPUs is omitted, its default will be set equal to the product of Boards, Sockets, CoresPerSocket, and ThreadsPerCore."

Has anyone else seen this error with Slurm 20.02?

I wonder if there is a problem with specifying SocketsPerBoard in stead of Sockets? The slurm.conf manual doesn't seem to prefer one over the other.

I've opened a bug https://bugs.schedmd.com/show_bug.cgi?id=9241

Thanks,
Ole


Reply via email to