Hi Ahmet,
On 6/16/20 11:27 AM, mercan wrote:
Did you check /var/log/messages file for errors. Systemctl logs this file,
instead of the slurmctl log file.
Ahmet M.
The syslog reports the same errors from slurmctld as are being reported by
every Slurm 20.02 command.
I have found a workaround: Replace NodeName lines "Boards=1
SocketsPerBoard=2" by "Sockets=2" in slurm.conf and reconfigure the
daemons. For some reason 20.02 doesn't handle "Boards" configurations
correctly.
Any site with "Boards" in slurm.conf should reconfigure to "Sockets"
before installing/upgrading to 20.02.
It may be a good idea to track updates to bug
https://bugs.schedmd.com/show_bug.cgi?id=9241
Best regards,
Ole
16.06.2020 11:12 tarihinde Ole Holm Nielsen yazdı:
Today we upgraded the controller node from 19.05 to 20.02.3, and
immediately all Slurm commands (on the controller node) give error
messages for all partitions:
# sinfo --version
sinfo: error: NodeNames=a[001-140] CPUs=1 match no Sockets,
Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore.
Resetting CPUs.
(lines deleted)
slurm 20.02.3
In slurm.conf we have defined NodeName like:
NodeName=a[001-140] Weight=10001 Boards=1 SocketsPerBoard=2
CoresPerSocket=4 ThreadsPerCore=1 ...
According to the slurm.conf manual the CPUs should then be calculated
automatically:
"If CPUs is omitted, its default will be set equal to the product of
Boards, Sockets, CoresPerSocket, and ThreadsPerCore."
Has anyone else seen this error with Slurm 20.02?
I wonder if there is a problem with specifying SocketsPerBoard in stead
of Sockets? The slurm.conf manual doesn't seem to prefer one over the
other.
I've opened a bug https://bugs.schedmd.com/show_bug.cgi?id=9241