We have a bunch of 64-core (quad-socket, 16 cores/socket) AMD servers and some 
of them are reporting the following error from slurm, which I gather gets its 
info from hwloc: 
May 27 11:53:04 n001 slurmd[3629]: Node configuration differs from hardware: 
CPUs=64:64(hw) Boards=1:1(hw) SocketsPerBoard=4:8(hw) CoresPerSocket=16:8(hw) 
ThreadsPerCore=1:1(hw)
All nodes have the exact same CPUs, motherboards and OS (PXE booted from the 
same master image even).  The bios settings between nodes also look the same.  
The nodes only differ in the amount of memory and number of DIMMs.  
There are two sets of nodes with different output from lstopo: Group 1 
(correct): reporting 4 sockets with 16 cores per socket
Group 2 (incorrect): reporting 8 sockets with 8 cores per socket Group 2 seems 
to be (incorrectly?) taking numanodes as sockets. The output of lstopo is 
slightly different in the two groups, note the extra Socket layer for group 2: 
Group 1: Machine (128GB) NUMANode L#0 (P#0 32GB) + Socket L#0 #16 cores listed 
<snip> NUMANode L#1 (P#2 32GB) + Socket L#1 #16 cores listed etc
<snip> Group 2: Machine (256GB) Socket L#0 (64GB) NUMANode L#0 (P#0 32GB) + L3 
L#0 (6144KB) # 8 cores listed <snip> NUMANode L#1 (P#1 32GB) + L3 L#1 (6144KB) 
# 8 cores listed <snip> Socket L#1 (64GB) NUMANode L#2 (P#2 32GB) + L3 L#2 
(6144KB) # 8 cores listed etc
<snip> The group 2 reporting doesn't match our hardware, at least as far as 
sockets and cores per socket goes--is there a reason other than the memory 
configuration that could cause this? 
Thanks,
Craig

Reply via email to