Hello

Good to know, thanks.

There are two ways to workaround the issue:
* run "lstopo foo.xml" on a node that doesn't have the bug and do export
HWLOC_XMLFILE=foo.xml and HWLOC_THISSYSTEM=1 on buggy nodes. (that's
what you call a "map" below). Works with very old hwloc releases.
* export HWLOC_COMPONENTS=x86 (only works since hwloc >= 1.11.2)

Brice




Le 07/01/2016 16:20, David Winslow a écrit :
> Brice,
>
> Thanks for the information! It’s good to know it wasn’t a flaw in the 
> upgrade. This bug must have been introduced in kernel 3.x. I ran lstopo on on 
> of our servers that still have Centos 6.5 and it correctly reports L3 cache 
> for every 6 cores as shown below.
>
> We have 75 servers with the exact same specifications. I have only upgraded 
> two when I came across this problem during testing. Since I have a correct 
> map on the non-upgraded servers, can I use that map on the upgraded servers 
> somehow? Essentially hard code it?
>
> ----------------------- FROM Centos 6.5 -----------------------
>   Socket L#0 (P#0 total=134215604KB CPUModel="AMD Opteron(tm) Processor 6344  
>                " CPUType=x86_64)
>     NUMANode L#0 (P#0 local=67106740KB total=67106740KB)
>       L3Cache L#0 (size=6144KB linesize=64 ways=64)
>         L2Cache L#0 (size=2048KB linesize=64 ways=16)
>           L1iCache L#0 (size=64KB linesize=64 ways=2)
>             L1dCache L#0 (size=16KB linesize=64 ways=4)
>               Core L#0 (P#0)
>                 PU L#0 (P#0)
>             L1dCache L#1 (size=16KB linesize=64 ways=4)
>               Core L#1 (P#1)
>                 PU L#1 (P#1)
>         L2Cache L#1 (size=2048KB linesize=64 ways=16)
>           L1iCache L#1 (size=64KB linesize=64 ways=2)
>             L1dCache L#2 (size=16KB linesize=64 ways=4)
>               Core L#2 (P#2)
>                 PU L#2 (P#2)
>             L1dCache L#3 (size=16KB linesize=64 ways=4)
>               Core L#3 (P#3)
>                 PU L#3 (P#3)
>         L2Cache L#2 (size=2048KB linesize=64 ways=16)
>           L1iCache L#2 (size=64KB linesize=64 ways=2)
>             L1dCache L#4 (size=16KB linesize=64 ways=4)
>               Core L#4 (P#4)
>                 PU L#4 (P#4)
>             L1dCache L#5 (size=16KB linesize=64 ways=4)
>               Core L#5 (P#5)
>                 PU L#5 (P#5)
>
>> On Jan 7, 2016, at 1:22 AM, Brice Goglin <brice.gog...@inria.fr> wrote:
>>
>> Hello
>>
>> This is a kernel bug for 12-core AMD Bulldozer/Piledriver (62xx/63xx) 
>> processors. hwloc is just complaining about buggy L3 information. lstopo 
>> should report one L3 above each set of 6 cores below each NUMA node. Instead 
>> you get strange L3s with 2, 4 or 6 cores.
>>
>> If you're not binding tasks based on L3 locality and if your applications do 
>> not care about L3, you can pass HWLOC_HIDE_ERRORS=1 in the environment to 
>> hide the message.
>>
>> AMD was working on a kernel patch but it doesn't seem to be in the upstream 
>> Linux yet. In hwloc v1.11.2, you can workaround the problem by passing 
>> HWLOC_COMPONENTS=x86 in the environment.
>>
>> I am not sure why CentOS 6.5 didn't complain. That 2.6.32 kernel should be 
>> buggy too, and old hwloc releases already complained about such bugs.
>>
>> thanks
>> Brice
>>
>>
>>
>>
>>
>>
>> Le 07/01/2016 04:10, David Winslow a écrit :
>>> I upgraded our servers from Centos 6.5 to Centos7.2. Since then, when I run 
>>> mpirun I get the following error but the software continues to run and it 
>>> appears to work fine.
>>>
>>> * hwloc 1.11.0rc3-git has encountered what looks like an error from the 
>>> operating system.
>>> *
>>> * L3 (cpuset 0x000003f0) intersects with NUMANode (P#0 cpuset 0x0000003f) 
>>> without inclusion!
>>> * Error occurred in topology.c line 983
>>> *
>>> * The following FAQ entry in the hwloc documentation may help:
>>> *   What should I do when hwloc reports "operating system" warnings?
>>> * Otherwise please report this error message to the hwloc user's mailing 
>>> list,
>>> * along with the output+tarball generated by the hwloc-gather-topology 
>>> script.
>>>
>>> I can replicate the error by simply running hwloc-info.
>>>
>>> The version of hwloc used with mpirun is 1.9. The version installed on the 
>>> server that I ran is 1.7 that comes with Centos 7. They both give the error 
>>> with minor differences shown below.
>>>
>>> With hwloc 1.7
>>> * object (L3 cpuset 0x000003f0) intersection without inclusion!
>>> * Error occurred in topology.c line 753
>>>
>>> With hwloc 1.9
>>> * L3 (cpuset 0x000003f0) intersects with NUMANode (P#0 cpuset 0x0000003f) 
>>> without inclusion!
>>> * Error occurred in topology.c line 983
>>>
>>> The current kernel is 3.10.0-327.el7.x86_64. I’ve tried updating the kernel 
>>> to a minor release update and even tried to install kernel v4.4.3. None of 
>>> the kernels worked. Again, hwloc works fine in Centos 6.5 with kernel 
>>> 2.6.32-431.29.2.el6.x86_64.
>>>
>>> I’ve attached the files generated by hwloc-gather-topology.sh.  I compared 
>>> what this script says is the expected output to the actual output and, from 
>>> what I can tell, they look the same. Maybe I’m missing something after 
>>> staring all day at the information.
>>>
>>> I did a clean install of the OS to perform the upgrade from 6.5.
>>>
>>> I’ve attached the results of the hwloc-gather-topology.sh script. Any help 
>>> will be greatly appreciated.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> hwloc-users mailing list
>>>
>>> hwloc-us...@open-mpi.org
>>>
>>> Subscription: 
>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>>>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/hwloc-users/2016/01/1238.php
>> _______________________________________________
>> hwloc-users mailing list
>> hwloc-us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/hwloc-users/2016/01/1240.php
> _______________________________________________
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
> Link to this post: 
> http://www.open-mpi.org/community/lists/hwloc-users/2016/01/1243.php

Reply via email to