We have a bunch of 64-core (quad-socket, 16 cores/socket) AMD servers and some
of them are reporting the following error from slurm, which I gather gets its
info from hwloc:
May 27 11:53:04 n001 slurmd[3629]: Node configuration differs from hardware:
CPUs=64:64(hw) Boards=1:1(hw) SocketsPerBoard=4:8(hw) CoresPerSocket=16:8(hw)
ThreadsPerCore=1:1(hw)
All nodes have the exact same CPUs, motherboards and OS (PXE booted from the
same master image even). The bios settings between nodes also look the same.
The nodes only differ in the amount of memory and number of DIMMs.
There are two sets of nodes with different output from lstopo: Group 1
(correct): reporting 4 sockets with 16 cores per socket
Group 2 (incorrect): reporting 8 sockets with 8 cores per socket Group 2 seems
to be (incorrectly?) taking numanodes as sockets. The output of lstopo is
slightly different in the two groups, note the extra Socket layer for group 2:
Group 1: Machine (128GB) NUMANode L#0 (P#0 32GB) + Socket L#0 #16 cores listed
<snip> NUMANode L#1 (P#2 32GB) + Socket L#1 #16 cores listed etc
<snip> Group 2: Machine (256GB) Socket L#0 (64GB) NUMANode L#0 (P#0 32GB) + L3
L#0 (6144KB) # 8 cores listed <snip> NUMANode L#1 (P#1 32GB) + L3 L#1 (6144KB)
# 8 cores listed <snip> Socket L#1 (64GB) NUMANode L#2 (P#2 32GB) + L3 L#2
(6144KB) # 8 cores listed etc
<snip> The group 2 reporting doesn't match our hardware, at least as far as
sockets and cores per socket goes--is there a reason other than the memory
configuration that could cause this?
Thanks,
Craig