Thanks for the help. I successfully created the XML from a good machine and 
used it on the buggy machine. Both lstopo and hwloc-info report correctly and I 
no longer get the error when running MPI.


David


> On Jan 7, 2016, at 10:29 AM, Brice Goglin <brice.gog...@inria.fr> wrote:
> 
> Hello
> 
> Good to know, thanks.
> 
> There are two ways to workaround the issue:
> * run "lstopo foo.xml" on a node that doesn't have the bug and do export
> HWLOC_XMLFILE=foo.xml and HWLOC_THISSYSTEM=1 on buggy nodes. (that's
> what you call a "map" below). Works with very old hwloc releases.
> * export HWLOC_COMPONENTS=x86 (only works since hwloc >= 1.11.2)
> 
> Brice
> 
> 
> 
> 
> Le 07/01/2016 16:20, David Winslow a écrit :
>> Brice,
>> 
>> Thanks for the information! It’s good to know it wasn’t a flaw in the 
>> upgrade. This bug must have been introduced in kernel 3.x. I ran lstopo on 
>> on of our servers that still have Centos 6.5 and it correctly reports L3 
>> cache for every 6 cores as shown below.
>> 
>> We have 75 servers with the exact same specifications. I have only upgraded 
>> two when I came across this problem during testing. Since I have a correct 
>> map on the non-upgraded servers, can I use that map on the upgraded servers 
>> somehow? Essentially hard code it?
>> 
>> ----------------------- FROM Centos 6.5 -----------------------
>>  Socket L#0 (P#0 total=134215604KB CPUModel="AMD Opteron(tm) Processor 6344  
>>                " CPUType=x86_64)
>>    NUMANode L#0 (P#0 local=67106740KB total=67106740KB)
>>      L3Cache L#0 (size=6144KB linesize=64 ways=64)
>>        L2Cache L#0 (size=2048KB linesize=64 ways=16)
>>          L1iCache L#0 (size=64KB linesize=64 ways=2)
>>            L1dCache L#0 (size=16KB linesize=64 ways=4)
>>              Core L#0 (P#0)
>>                PU L#0 (P#0)
>>            L1dCache L#1 (size=16KB linesize=64 ways=4)
>>              Core L#1 (P#1)
>>                PU L#1 (P#1)
>>        L2Cache L#1 (size=2048KB linesize=64 ways=16)
>>          L1iCache L#1 (size=64KB linesize=64 ways=2)
>>            L1dCache L#2 (size=16KB linesize=64 ways=4)
>>              Core L#2 (P#2)
>>                PU L#2 (P#2)
>>            L1dCache L#3 (size=16KB linesize=64 ways=4)
>>              Core L#3 (P#3)
>>                PU L#3 (P#3)
>>        L2Cache L#2 (size=2048KB linesize=64 ways=16)
>>          L1iCache L#2 (size=64KB linesize=64 ways=2)
>>            L1dCache L#4 (size=16KB linesize=64 ways=4)
>>              Core L#4 (P#4)
>>                PU L#4 (P#4)
>>            L1dCache L#5 (size=16KB linesize=64 ways=4)
>>              Core L#5 (P#5)
>>                PU L#5 (P#5)
>> 
>>> On Jan 7, 2016, at 1:22 AM, Brice Goglin <brice.gog...@inria.fr> wrote:
>>> 
>>> Hello
>>> 
>>> This is a kernel bug for 12-core AMD Bulldozer/Piledriver (62xx/63xx) 
>>> processors. hwloc is just complaining about buggy L3 information. lstopo 
>>> should report one L3 above each set of 6 cores below each NUMA node. 
>>> Instead you get strange L3s with 2, 4 or 6 cores.
>>> 
>>> If you're not binding tasks based on L3 locality and if your applications 
>>> do not care about L3, you can pass HWLOC_HIDE_ERRORS=1 in the environment 
>>> to hide the message.
>>> 
>>> AMD was working on a kernel patch but it doesn't seem to be in the upstream 
>>> Linux yet. In hwloc v1.11.2, you can workaround the problem by passing 
>>> HWLOC_COMPONENTS=x86 in the environment.
>>> 
>>> I am not sure why CentOS 6.5 didn't complain. That 2.6.32 kernel should be 
>>> buggy too, and old hwloc releases already complained about such bugs.
>>> 
>>> thanks
>>> Brice
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Le 07/01/2016 04:10, David Winslow a écrit :
>>>> I upgraded our servers from Centos 6.5 to Centos7.2. Since then, when I 
>>>> run mpirun I get the following error but the software continues to run and 
>>>> it appears to work fine.
>>>> 
>>>> * hwloc 1.11.0rc3-git has encountered what looks like an error from the 
>>>> operating system.
>>>> *
>>>> * L3 (cpuset 0x000003f0) intersects with NUMANode (P#0 cpuset 0x0000003f) 
>>>> without inclusion!
>>>> * Error occurred in topology.c line 983
>>>> *
>>>> * The following FAQ entry in the hwloc documentation may help:
>>>> *   What should I do when hwloc reports "operating system" warnings?
>>>> * Otherwise please report this error message to the hwloc user's mailing 
>>>> list,
>>>> * along with the output+tarball generated by the hwloc-gather-topology 
>>>> script.
>>>> 
>>>> I can replicate the error by simply running hwloc-info.
>>>> 
>>>> The version of hwloc used with mpirun is 1.9. The version installed on the 
>>>> server that I ran is 1.7 that comes with Centos 7. They both give the 
>>>> error with minor differences shown below.
>>>> 
>>>> With hwloc 1.7
>>>> * object (L3 cpuset 0x000003f0) intersection without inclusion!
>>>> * Error occurred in topology.c line 753
>>>> 
>>>> With hwloc 1.9
>>>> * L3 (cpuset 0x000003f0) intersects with NUMANode (P#0 cpuset 0x0000003f) 
>>>> without inclusion!
>>>> * Error occurred in topology.c line 983
>>>> 
>>>> The current kernel is 3.10.0-327.el7.x86_64. I’ve tried updating the 
>>>> kernel to a minor release update and even tried to install kernel v4.4.3. 
>>>> None of the kernels worked. Again, hwloc works fine in Centos 6.5 with 
>>>> kernel 2.6.32-431.29.2.el6.x86_64.
>>>> 
>>>> I’ve attached the files generated by hwloc-gather-topology.sh.  I compared 
>>>> what this script says is the expected output to the actual output and, 
>>>> from what I can tell, they look the same. Maybe I’m missing something 
>>>> after staring all day at the information.
>>>> 
>>>> I did a clean install of the OS to perform the upgrade from 6.5.
>>>> 
>>>> I’ve attached the results of the hwloc-gather-topology.sh script. Any help 
>>>> will be greatly appreciated.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> hwloc-users mailing list
>>>> 
>>>> hwloc-us...@open-mpi.org
>>>> 
>>>> Subscription: 
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>>>> 
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/hwloc-users/2016/01/1238.php
>>> _______________________________________________
>>> hwloc-users mailing list
>>> hwloc-us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/hwloc-users/2016/01/1240.php
>> _______________________________________________
>> hwloc-users mailing list
>> hwloc-us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/hwloc-users/2016/01/1243.php
> 
> _______________________________________________
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
> Link to this post: 
> http://www.open-mpi.org/community/lists/hwloc-users/2016/01/1244.php

Reply via email to