Le 28/05/2014 14:13, Craig Kapfer a écrit : > Interesting, quite right, thank you very much. Yes these are AMD 6300 > series. Same kernel but these boxes seem to have different BIOS > versions, direct from the factory, delivered in the same physical > enclosure even! Some are AMI 3.5 and some are 3.0. > > So slurm is then incorrectly parsing correct output from lstopo to > generate this message? >> >> May 27 11:53:04 n001 slurmd[3629]: Node configuration differs from >> hardware: CPUs=64:64(hw) Boards=1:1(hw) SocketsPerBoard=4:8(hw) >> CoresPerSocket=16:8(hw) ThreadsPerCore=1:1(hw) >>
It's saying "there are 8 sockets with 8 cores in hw instead of 4 sockets with 16 cores each in config" ? My feeling is that Slurm just has a (valid) config that says group2 while it was running on group1 in this case. Brice > Thanks much, > > Craig > > > On Wednesday, May 28, 2014 1:39 PM, Brice Goglin > <brice.gog...@inria.fr> wrote: > > > Aside of the BIOS config, are you sure that you have the exact same > BIOS *version* in each node? (can check in /sys/class/dmi/id/bios_*) > Same Linux kernel too? > > Also, recently we've seen somebody fix such problems by unplugging and > replugging some CPUs on the motherboard. Seems crazy but it happened > for real... > > By the way, your discussion of groups 1 and 2 below is wrong. Group 2 > doesn't say that NUMA node == socket, and it doesn't report 8 sockets > of 8 cores each. It reports 4 sockets containing 2 NUMA nodes each > containing 8 cores each, and that's likely what you have here (AMD > Opteron 6300 or 6200 processors?). > > Brice > > > > Le 28/05/2014 12:27, Craig Kapfer a écrit : >> We have a bunch of 64-core (quad-socket, 16 cores/socket) AMD servers and >> some of them are reporting the following error from slurm, which I gather >> gets its info from hwloc: >> >> May 27 11:53:04 n001 slurmd[3629]: Node configuration differs from >> hardware: CPUs=64:64(hw) Boards=1:1(hw) SocketsPerBoard=4:8(hw) >> CoresPerSocket=16:8(hw) ThreadsPerCore=1:1(hw) >> >> All nodes have the exact same CPUs, motherboards and OS (PXE booted from the >> same master image even). The bios settings between nodes also look the >> same. The nodes only differ in the amount of memory and number of DIMMs. >> There are two sets of nodes with different output from lstopo: >> >> Group 1 (correct): reporting 4 sockets with 16 cores per socket >> Group 2 (incorrect): reporting 8 sockets with 8 cores per socket >> >> Group 2 seems to be (incorrectly?) taking numanodes as sockets. >> >> The output of lstopo is slightly different in the two groups, note the extra >> Socket layer for group 2: >> >> Group 1: >> Machine (128GB) >> NUMANode L#0 (P#0 32GB) + Socket L#0 >> #16 cores listed >> <snip> >> NUMANode L#1 (P#2 32GB) + Socket L#1 >> #16 cores listed >> etc >> <snip> >> >> Group 2: >> Machine (256GB) >> Socket L#0 (64GB) >> NUMANode L#0 (P#0 32GB) + L3 L#0 (6144KB) >> # 8 cores listed >> <snip> >> NUMANode L#1 (P#1 32GB) + L3 L#1 (6144KB) >> # 8 cores listed >> <snip> >> Socket L#1 (64GB) >> NUMANode L#2 (P#2 32GB) + L3 L#2 (6144KB) >> # 8 cores listed >> etc >> <snip> >> >> The group 2 reporting doesn't match our hardware, at least as far as sockets >> and cores per socket goes--is there a reason other than the memory >> configuration that could cause this? >> Thanks, >> Craig >> >> >> >> >> _______________________________________________ >> hwloc-users mailing list >> hwloc-us...@open-mpi.org <mailto:hwloc-us...@open-mpi.org> >> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users > > >