We (Georgia Tech) too have been observing this on 16-core AMD AbuDhabi machines (6378). We weren’t aware of HWLOC_COMPONENTS workaround, which seems to mitigate the issue.
Before: # ./lstopo **************************************************************************** * hwloc has encountered what looks like an error from the operating system. * * Socket (P#2 cpuset 0x0000ffff,0x0) intersects with NUMANode (P#3 cpuset 0x0000ff00,0xff000000) without inclusion! * Error occurred in topology.c line 940 * * Please report this error message to the hwloc user's mailing list, * along with the output+tarball generated by the hwloc-gather-topology script. **************************************************************************** Machine (128GB total) Group0 L#0 NUMANode L#0 (P#1 32GB) ... After: # export HWLOC_COMPONENTS=x86 # ./lstopo Machine Socket L#0 NUMANode L#0 (P#0) + L3 L#0 (6144KB) L2 L#0 (2048KB) + L1i L#0 (64KB) ... These nodes are the only one in our entire cluster to cause zombie processes using torque/moab. I have a feeling that they are related. We use hwloc/1.10.0. Not sure if this helps at all, but you are definitely not alone :) Thanks, -Mehmet On Jun 29, 2017, at 1:24 AM, Brice Goglin <brice.gog...@inria.fr<mailto:brice.gog...@inria.fr>> wrote: Hello We've seen this issue many times (it's specific to 12-core opterons), but I am surprised it still occurs with such a recent kernel. AMD was supposed to fix the kernel in early 2016 but I forgot checking whether something was actually pushed. Anyway, you can likely ignore the issue as documented in the FAQ https://www.open-mpi.org/projects/hwloc/doc/v1.11.7/a00305.php unless you care about L3 affinity for binding. Otherwise, you can workaround the issue by passing HWLOC_COMPONENTS=x86 in the environment so that hwloc uses cpuid before of Linux sysfs files for discovery the topology. Brice Le 29/06/2017 02:17, Julio Figueroa a écrit : Hi I am experincing the following issues when using pnetcdf version 1.8.1 The machine is a Supermicro (H8DGi) dual socket AMD Opteron 6238 (patch_level=0x0600063d) The BIOS is the lates from Supermicro (v3.5c 03/18/2016) OS: Debian 9.0 Kernel: 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u1 (2017-06-18) x86_64 GNU/Linux **************************************************************************** * hwloc 1.11.5 has encountered what looks like an error from the operating system. * * L3 (cpuset 0x000003f0) intersects with NUMANode (P#0 cpuset 0x0000003f) without inclusion! * Error occurred in topology.c line 1074 * * The following FAQ entry in the hwloc documentation may help: * What should I do when hwloc reports "operating system" warnings? * Otherwise please report this error message to the hwloc user's mailing list, * along with the output+tarball generated by the hwloc-gather-topology script. **************************************************************************** As suggested by the error message, here is the hwloc-gather-topology attached. Please let me know if you need more information. Julio Figueroa Oceanographer _______________________________________________ hwloc-users mailing list hwloc-users@lists.open-mpi.org<mailto:hwloc-users@lists.open-mpi.org> https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users _______________________________________________ hwloc-users mailing list hwloc-users@lists.open-mpi.org<mailto:hwloc-users@lists.open-mpi.org> https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users
_______________________________________________ hwloc-users mailing list hwloc-users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/hwloc-users