Hello Good to know, thanks.
There are two ways to workaround the issue: * run "lstopo foo.xml" on a node that doesn't have the bug and do export HWLOC_XMLFILE=foo.xml and HWLOC_THISSYSTEM=1 on buggy nodes. (that's what you call a "map" below). Works with very old hwloc releases. * export HWLOC_COMPONENTS=x86 (only works since hwloc >= 1.11.2) Brice Le 07/01/2016 16:20, David Winslow a écrit : > Brice, > > Thanks for the information! It’s good to know it wasn’t a flaw in the > upgrade. This bug must have been introduced in kernel 3.x. I ran lstopo on on > of our servers that still have Centos 6.5 and it correctly reports L3 cache > for every 6 cores as shown below. > > We have 75 servers with the exact same specifications. I have only upgraded > two when I came across this problem during testing. Since I have a correct > map on the non-upgraded servers, can I use that map on the upgraded servers > somehow? Essentially hard code it? > > ----------------------- FROM Centos 6.5 ----------------------- > Socket L#0 (P#0 total=134215604KB CPUModel="AMD Opteron(tm) Processor 6344 > " CPUType=x86_64) > NUMANode L#0 (P#0 local=67106740KB total=67106740KB) > L3Cache L#0 (size=6144KB linesize=64 ways=64) > L2Cache L#0 (size=2048KB linesize=64 ways=16) > L1iCache L#0 (size=64KB linesize=64 ways=2) > L1dCache L#0 (size=16KB linesize=64 ways=4) > Core L#0 (P#0) > PU L#0 (P#0) > L1dCache L#1 (size=16KB linesize=64 ways=4) > Core L#1 (P#1) > PU L#1 (P#1) > L2Cache L#1 (size=2048KB linesize=64 ways=16) > L1iCache L#1 (size=64KB linesize=64 ways=2) > L1dCache L#2 (size=16KB linesize=64 ways=4) > Core L#2 (P#2) > PU L#2 (P#2) > L1dCache L#3 (size=16KB linesize=64 ways=4) > Core L#3 (P#3) > PU L#3 (P#3) > L2Cache L#2 (size=2048KB linesize=64 ways=16) > L1iCache L#2 (size=64KB linesize=64 ways=2) > L1dCache L#4 (size=16KB linesize=64 ways=4) > Core L#4 (P#4) > PU L#4 (P#4) > L1dCache L#5 (size=16KB linesize=64 ways=4) > Core L#5 (P#5) > PU L#5 (P#5) > >> On Jan 7, 2016, at 1:22 AM, Brice Goglin <brice.gog...@inria.fr> wrote: >> >> Hello >> >> This is a kernel bug for 12-core AMD Bulldozer/Piledriver (62xx/63xx) >> processors. hwloc is just complaining about buggy L3 information. lstopo >> should report one L3 above each set of 6 cores below each NUMA node. Instead >> you get strange L3s with 2, 4 or 6 cores. >> >> If you're not binding tasks based on L3 locality and if your applications do >> not care about L3, you can pass HWLOC_HIDE_ERRORS=1 in the environment to >> hide the message. >> >> AMD was working on a kernel patch but it doesn't seem to be in the upstream >> Linux yet. In hwloc v1.11.2, you can workaround the problem by passing >> HWLOC_COMPONENTS=x86 in the environment. >> >> I am not sure why CentOS 6.5 didn't complain. That 2.6.32 kernel should be >> buggy too, and old hwloc releases already complained about such bugs. >> >> thanks >> Brice >> >> >> >> >> >> >> Le 07/01/2016 04:10, David Winslow a écrit : >>> I upgraded our servers from Centos 6.5 to Centos7.2. Since then, when I run >>> mpirun I get the following error but the software continues to run and it >>> appears to work fine. >>> >>> * hwloc 1.11.0rc3-git has encountered what looks like an error from the >>> operating system. >>> * >>> * L3 (cpuset 0x000003f0) intersects with NUMANode (P#0 cpuset 0x0000003f) >>> without inclusion! >>> * Error occurred in topology.c line 983 >>> * >>> * The following FAQ entry in the hwloc documentation may help: >>> * What should I do when hwloc reports "operating system" warnings? >>> * Otherwise please report this error message to the hwloc user's mailing >>> list, >>> * along with the output+tarball generated by the hwloc-gather-topology >>> script. >>> >>> I can replicate the error by simply running hwloc-info. >>> >>> The version of hwloc used with mpirun is 1.9. The version installed on the >>> server that I ran is 1.7 that comes with Centos 7. They both give the error >>> with minor differences shown below. >>> >>> With hwloc 1.7 >>> * object (L3 cpuset 0x000003f0) intersection without inclusion! >>> * Error occurred in topology.c line 753 >>> >>> With hwloc 1.9 >>> * L3 (cpuset 0x000003f0) intersects with NUMANode (P#0 cpuset 0x0000003f) >>> without inclusion! >>> * Error occurred in topology.c line 983 >>> >>> The current kernel is 3.10.0-327.el7.x86_64. I’ve tried updating the kernel >>> to a minor release update and even tried to install kernel v4.4.3. None of >>> the kernels worked. Again, hwloc works fine in Centos 6.5 with kernel >>> 2.6.32-431.29.2.el6.x86_64. >>> >>> I’ve attached the files generated by hwloc-gather-topology.sh. I compared >>> what this script says is the expected output to the actual output and, from >>> what I can tell, they look the same. Maybe I’m missing something after >>> staring all day at the information. >>> >>> I did a clean install of the OS to perform the upgrade from 6.5. >>> >>> I’ve attached the results of the hwloc-gather-topology.sh script. Any help >>> will be greatly appreciated. >>> >>> >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> hwloc-users mailing list >>> >>> hwloc-us...@open-mpi.org >>> >>> Subscription: >>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users >>> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/hwloc-users/2016/01/1238.php >> _______________________________________________ >> hwloc-users mailing list >> hwloc-us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users >> Link to this post: >> http://www.open-mpi.org/community/lists/hwloc-users/2016/01/1240.php > _______________________________________________ > hwloc-users mailing list > hwloc-us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users > Link to this post: > http://www.open-mpi.org/community/lists/hwloc-users/2016/01/1243.php