Le 28/05/2014 14:13, Craig Kapfer a écrit :
> Interesting, quite right, thank you very much.  Yes these are AMD 6300
> series.  Same kernel but these boxes seem to have different BIOS
> versions, direct from the factory, delivered in the same physical
> enclosure even!  Some are AMI 3.5 and some are 3.0.
>
> So slurm is then incorrectly parsing correct output from lstopo to
> generate this message?
>>
>>     May 27 11:53:04 n001 slurmd[3629]: Node configuration differs from 
>> hardware: CPUs=64:64(hw) Boards=1:1(hw) SocketsPerBoard=4:8(hw)
>>      CoresPerSocket=16:8(hw) ThreadsPerCore=1:1(hw)
>>

It's saying "there are 8 sockets with 8 cores in hw instead of 4 sockets
with 16 cores each in config" ?
My feeling is that Slurm just has a (valid) config that says group2
while it was running on group1 in this case.

Brice


> Thanks much,
>
> Craig
>
>
> On Wednesday, May 28, 2014 1:39 PM, Brice Goglin
> <brice.gog...@inria.fr> wrote:
>
>
> Aside of the BIOS config, are you sure that you have the exact same
> BIOS *version* in each node? (can check in /sys/class/dmi/id/bios_*)
> Same Linux kernel too?
>
> Also, recently we've seen somebody fix such problems by unplugging and
> replugging some CPUs on the motherboard. Seems crazy but it happened
> for real...
>
> By the way, your discussion of groups 1 and 2 below is wrong. Group 2
> doesn't say that NUMA node == socket, and it doesn't report 8 sockets
> of 8 cores each. It reports 4 sockets containing 2 NUMA nodes each
> containing 8 cores each, and that's likely what you have here (AMD
> Opteron 6300 or 6200 processors?).
>
> Brice
>
>
>
> Le 28/05/2014 12:27, Craig Kapfer a écrit :
>> We have a bunch of 64-core (quad-socket, 16 cores/socket) AMD servers and 
>> some of them are reporting the following error from slurm, which I gather 
>> gets its info from hwloc:
>>
>>     May 27 11:53:04 n001 slurmd[3629]: Node configuration differs from 
>> hardware: CPUs=64:64(hw) Boards=1:1(hw) SocketsPerBoard=4:8(hw) 
>> CoresPerSocket=16:8(hw) ThreadsPerCore=1:1(hw)
>>
>> All nodes have the exact same CPUs, motherboards and OS (PXE booted from the 
>> same master image even).  The bios settings between nodes also look the 
>> same.  The nodes only differ in the amount of memory and number of DIMMs.  
>> There are two sets of nodes with different output from lstopo:
>>
>> Group 1 (correct): reporting 4 sockets with 16 cores per socket
>> Group 2 (incorrect): reporting 8 sockets with 8 cores per socket
>>
>> Group 2 seems to be (incorrectly?) taking numanodes as sockets.
>>
>> The output of lstopo is slightly different in the two groups, note the extra 
>> Socket layer for group 2:
>>
>> Group 1: 
>> Machine (128GB)
>>   NUMANode L#0 (P#0 32GB) + Socket L#0
>>   #16 cores listed
>>   <snip>
>>   NUMANode L#1 (P#2 32GB) + Socket L#1
>>   #16 cores listed
>>   etc
>> <snip>
>>
>> Group 2:
>> Machine (256GB)
>>   Socket L#0 (64GB)
>>     NUMANode L#0 (P#0 32GB) + L3 L#0 (6144KB)
>>     # 8 cores listed
>>     <snip>
>>     NUMANode L#1 (P#1 32GB) + L3 L#1 (6144KB)
>>     # 8 cores listed
>>     <snip>
>>   Socket L#1 (64GB)
>>     NUMANode L#2 (P#2 32GB) + L3 L#2 (6144KB)
>>     # 8 cores listed
>>     etc
>> <snip>
>>
>> The group 2 reporting doesn't match our hardware, at least as far as sockets 
>> and cores per socket goes--is there a reason other than the memory 
>> configuration that could cause this? 
>> Thanks,
>> Craig
>>
>>
>>
>>
>> _______________________________________________
>> hwloc-users mailing list
>> hwloc-us...@open-mpi.org <mailto:hwloc-us...@open-mpi.org>
>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>
>
>

Reply via email to