Thanks for the help. I successfully created the XML from a good machine and used it on the buggy machine. Both lstopo and hwloc-info report correctly and I no longer get the error when running MPI.
David > On Jan 7, 2016, at 10:29 AM, Brice Goglin <brice.gog...@inria.fr> wrote: > > Hello > > Good to know, thanks. > > There are two ways to workaround the issue: > * run "lstopo foo.xml" on a node that doesn't have the bug and do export > HWLOC_XMLFILE=foo.xml and HWLOC_THISSYSTEM=1 on buggy nodes. (that's > what you call a "map" below). Works with very old hwloc releases. > * export HWLOC_COMPONENTS=x86 (only works since hwloc >= 1.11.2) > > Brice > > > > > Le 07/01/2016 16:20, David Winslow a écrit : >> Brice, >> >> Thanks for the information! It’s good to know it wasn’t a flaw in the >> upgrade. This bug must have been introduced in kernel 3.x. I ran lstopo on >> on of our servers that still have Centos 6.5 and it correctly reports L3 >> cache for every 6 cores as shown below. >> >> We have 75 servers with the exact same specifications. I have only upgraded >> two when I came across this problem during testing. Since I have a correct >> map on the non-upgraded servers, can I use that map on the upgraded servers >> somehow? Essentially hard code it? >> >> ----------------------- FROM Centos 6.5 ----------------------- >> Socket L#0 (P#0 total=134215604KB CPUModel="AMD Opteron(tm) Processor 6344 >> " CPUType=x86_64) >> NUMANode L#0 (P#0 local=67106740KB total=67106740KB) >> L3Cache L#0 (size=6144KB linesize=64 ways=64) >> L2Cache L#0 (size=2048KB linesize=64 ways=16) >> L1iCache L#0 (size=64KB linesize=64 ways=2) >> L1dCache L#0 (size=16KB linesize=64 ways=4) >> Core L#0 (P#0) >> PU L#0 (P#0) >> L1dCache L#1 (size=16KB linesize=64 ways=4) >> Core L#1 (P#1) >> PU L#1 (P#1) >> L2Cache L#1 (size=2048KB linesize=64 ways=16) >> L1iCache L#1 (size=64KB linesize=64 ways=2) >> L1dCache L#2 (size=16KB linesize=64 ways=4) >> Core L#2 (P#2) >> PU L#2 (P#2) >> L1dCache L#3 (size=16KB linesize=64 ways=4) >> Core L#3 (P#3) >> PU L#3 (P#3) >> L2Cache L#2 (size=2048KB linesize=64 ways=16) >> L1iCache L#2 (size=64KB linesize=64 ways=2) >> L1dCache L#4 (size=16KB linesize=64 ways=4) >> Core L#4 (P#4) >> PU L#4 (P#4) >> L1dCache L#5 (size=16KB linesize=64 ways=4) >> Core L#5 (P#5) >> PU L#5 (P#5) >> >>> On Jan 7, 2016, at 1:22 AM, Brice Goglin <brice.gog...@inria.fr> wrote: >>> >>> Hello >>> >>> This is a kernel bug for 12-core AMD Bulldozer/Piledriver (62xx/63xx) >>> processors. hwloc is just complaining about buggy L3 information. lstopo >>> should report one L3 above each set of 6 cores below each NUMA node. >>> Instead you get strange L3s with 2, 4 or 6 cores. >>> >>> If you're not binding tasks based on L3 locality and if your applications >>> do not care about L3, you can pass HWLOC_HIDE_ERRORS=1 in the environment >>> to hide the message. >>> >>> AMD was working on a kernel patch but it doesn't seem to be in the upstream >>> Linux yet. In hwloc v1.11.2, you can workaround the problem by passing >>> HWLOC_COMPONENTS=x86 in the environment. >>> >>> I am not sure why CentOS 6.5 didn't complain. That 2.6.32 kernel should be >>> buggy too, and old hwloc releases already complained about such bugs. >>> >>> thanks >>> Brice >>> >>> >>> >>> >>> >>> >>> Le 07/01/2016 04:10, David Winslow a écrit : >>>> I upgraded our servers from Centos 6.5 to Centos7.2. Since then, when I >>>> run mpirun I get the following error but the software continues to run and >>>> it appears to work fine. >>>> >>>> * hwloc 1.11.0rc3-git has encountered what looks like an error from the >>>> operating system. >>>> * >>>> * L3 (cpuset 0x000003f0) intersects with NUMANode (P#0 cpuset 0x0000003f) >>>> without inclusion! >>>> * Error occurred in topology.c line 983 >>>> * >>>> * The following FAQ entry in the hwloc documentation may help: >>>> * What should I do when hwloc reports "operating system" warnings? >>>> * Otherwise please report this error message to the hwloc user's mailing >>>> list, >>>> * along with the output+tarball generated by the hwloc-gather-topology >>>> script. >>>> >>>> I can replicate the error by simply running hwloc-info. >>>> >>>> The version of hwloc used with mpirun is 1.9. The version installed on the >>>> server that I ran is 1.7 that comes with Centos 7. They both give the >>>> error with minor differences shown below. >>>> >>>> With hwloc 1.7 >>>> * object (L3 cpuset 0x000003f0) intersection without inclusion! >>>> * Error occurred in topology.c line 753 >>>> >>>> With hwloc 1.9 >>>> * L3 (cpuset 0x000003f0) intersects with NUMANode (P#0 cpuset 0x0000003f) >>>> without inclusion! >>>> * Error occurred in topology.c line 983 >>>> >>>> The current kernel is 3.10.0-327.el7.x86_64. I’ve tried updating the >>>> kernel to a minor release update and even tried to install kernel v4.4.3. >>>> None of the kernels worked. Again, hwloc works fine in Centos 6.5 with >>>> kernel 2.6.32-431.29.2.el6.x86_64. >>>> >>>> I’ve attached the files generated by hwloc-gather-topology.sh. I compared >>>> what this script says is the expected output to the actual output and, >>>> from what I can tell, they look the same. Maybe I’m missing something >>>> after staring all day at the information. >>>> >>>> I did a clean install of the OS to perform the upgrade from 6.5. >>>> >>>> I’ve attached the results of the hwloc-gather-topology.sh script. Any help >>>> will be greatly appreciated. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> hwloc-users mailing list >>>> >>>> hwloc-us...@open-mpi.org >>>> >>>> Subscription: >>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users >>>> >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/hwloc-users/2016/01/1238.php >>> _______________________________________________ >>> hwloc-users mailing list >>> hwloc-us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/hwloc-users/2016/01/1240.php >> _______________________________________________ >> hwloc-users mailing list >> hwloc-us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users >> Link to this post: >> http://www.open-mpi.org/community/lists/hwloc-users/2016/01/1243.php > > _______________________________________________ > hwloc-users mailing list > hwloc-us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users > Link to this post: > http://www.open-mpi.org/community/lists/hwloc-users/2016/01/1244.php