We (Georgia Tech) too have been observing this on 16-core AMD AbuDhabi machines 
(6378). We weren’t aware of HWLOC_COMPONENTS workaround, which seems to 
mitigate the issue.

Before:

# ./lstopo
****************************************************************************
* hwloc has encountered what looks like an error from the operating system.
*
* Socket (P#2 cpuset 0x0000ffff,0x0) intersects with NUMANode (P#3 cpuset 
0x0000ff00,0xff000000) without inclusion!
* Error occurred in topology.c line 940
*
* Please report this error message to the hwloc user's mailing list,
* along with the output+tarball generated by the hwloc-gather-topology script.
****************************************************************************
Machine (128GB total)
  Group0 L#0
    NUMANode L#0 (P#1 32GB)
...

After:

# export HWLOC_COMPONENTS=x86
# ./lstopo
Machine
  Socket L#0
    NUMANode L#0 (P#0) + L3 L#0 (6144KB)
      L2 L#0 (2048KB) + L1i L#0 (64KB)
...

These nodes are the only one in our entire cluster to cause zombie processes 
using torque/moab. I have a feeling that they are related. We use hwloc/1.10.0.

Not sure if this helps at all, but you are definitely not alone :)

Thanks,
-Mehmet



On Jun 29, 2017, at 1:24 AM, Brice Goglin 
<brice.gog...@inria.fr<mailto:brice.gog...@inria.fr>> wrote:

Hello

We've seen this issue many times (it's specific to 12-core opterons), but I am 
surprised it still occurs with such a recent kernel. AMD was supposed to fix 
the kernel in early 2016 but I forgot checking whether something was actually 
pushed.

Anyway, you can likely ignore the issue as documented in the FAQ 
https://www.open-mpi.org/projects/hwloc/doc/v1.11.7/a00305.php unless you care 
about L3 affinity for binding. Otherwise, you can workaround the issue by 
passing HWLOC_COMPONENTS=x86 in the environment so that hwloc uses cpuid before 
of Linux sysfs files for discovery the topology.

Brice




Le 29/06/2017 02:17, Julio Figueroa a écrit :
Hi

I am experincing the following issues when using pnetcdf version 1.8.1
The machine is a Supermicro (H8DGi) dual socket AMD Opteron 6238 
(patch_level=0x0600063d)
The BIOS is the lates from Supermicro (v3.5c 03/18/2016)
OS: Debian 9.0 Kernel: 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u1 (2017-06-18) 
x86_64 GNU/Linux
****************************************************************************
* hwloc 1.11.5 has encountered what looks like an error from the operating 
system.
*
* L3 (cpuset 0x000003f0) intersects with NUMANode (P#0 cpuset 0x0000003f) 
without inclusion!
* Error occurred in topology.c line 1074
*
* The following FAQ entry in the hwloc documentation may help:
*   What should I do when hwloc reports "operating system" warnings?
* Otherwise please report this error message to the hwloc user's mailing list,
* along with the output+tarball generated by the hwloc-gather-topology script.
****************************************************************************

As suggested by the error message, here is the hwloc-gather-topology
attached.

Please let me know if you need more information.

Julio Figueroa
Oceanographer




_______________________________________________
hwloc-users mailing list
hwloc-users@lists.open-mpi.org<mailto:hwloc-users@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users

_______________________________________________
hwloc-users mailing list
hwloc-users@lists.open-mpi.org<mailto:hwloc-users@lists.open-mpi.org>
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users

_______________________________________________
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users

Reply via email to