This patch should fix the issue. We had to fix the same issue for CPU#0 being offline recently but I didn't know it could be needed for NUMA node#0 being offline too.

I am trying to release hwloc 2.5 "soon". If that's too slow, please let me know, I'll see if I can do a 2.4.1 earlier.

Brice




commit 7c159d723432e461b4e48cc2d38212913d2ba7c7
Author: Brice Goglin <brice.gog...@inria.fr>
Date:   Mon Apr 26 20:35:42 2021 +0200

    linux: fix support for NUMA node0 being oddline
Just like we didn't support offline CPU#0 until commit
    7bcc273efd50536961ba16d474efca4ae163229b, we need to
    support node0 being offline as well.
    It's not clear whether it's a new Linux feature or not,
    this was reported on a POWER LPAR VM.
We opportunistically assume node0 is online to avoid
    the overhead in the vast majority of cases. If node0
    is missing, we parse "online" to find the first node.
Thanks to Jirka Hladky for the report. Signed-off-by: Brice Goglin <brice.gog...@inria.fr>

diff --git a/hwloc/topology-linux.c b/hwloc/topology-linux.c
index 94b242dd0..10e038e64 100644
--- a/hwloc/topology-linux.c
+++ b/hwloc/topology-linux.c
@@ -5264,6 +5264,9 @@ static const char *find_sysfs_cpu_path(int root_fd, int *old_filenames) static const char *find_sysfs_node_path(int root_fd)
 {
+  unsigned first;
+  int err;
+
   if (!hwloc_access("/sys/bus/node/devices", R_OK|X_OK, root_fd)
       && !hwloc_access("/sys/bus/node/devices/node0/cpumap", R_OK, root_fd))
     return "/sys/bus/node/devices";
@@ -5272,6 +5275,28 @@ static const char *find_sysfs_node_path(int root_fd)
       && !hwloc_access("/sys/devices/system/node/node0/cpumap", R_OK, root_fd))
     return "/sys/devices/system/node";
+ /* node0 might be offline, fallback to looking at the first online node.
+   * online contains comma-separated ranges, just read the first number.
+   */
+  hwloc_debug("Failed to find sysfs node files using node0, looking at online 
nodes...\n");
+  err = hwloc_read_path_as_uint("/sys/devices/system/node/online", &first, 
root_fd);
+  if (err) {
+    hwloc_debug("Failed to find read /sys/devices/system/node/online.\n");
+  } else {
+    char path[PATH_MAX];
+    hwloc_debug("Found node#%u as first online node\n", first);
+
+    snprintf(path, sizeof(path), "/sys/bus/node/devices/node%u/cpumap", first);
+    if (!hwloc_access("/sys/bus/node/devices", R_OK|X_OK, root_fd)
+        && !hwloc_access(path, R_OK, root_fd))
+      return "/sys/bus/node/devices";
+
+    snprintf(path, sizeof(path), "/sys/devices/system/node/node%u/cpumap", 
first);
+    if (!hwloc_access("/sys/devices/system/node", R_OK|X_OK, root_fd)
+        && !hwloc_access(path, R_OK, root_fd))
+      return "/sys/devices/system/node";
+  }
+
   return NULL;
 }




Le 26/04/2021 à 16:48, Brice Goglin a écrit :

Hello,

Maybe we have something that assumes that the first NUMA node on Linux is #0. And something is wrong in the disallowed case anyway since the NUMA node physical number is 0 instead of 2 there.

Can you run "hwloc-gather-topology lpar" and send the resulting lpar.tar.bz2? (send it only to me if it's too big or somehow confidential).

Thanks

Brice



Le 26/04/2021 à 16:40, Jirka Hladky a écrit :
Hi Brice,

how are you doing? I hope you are fine. We are all well and safe.

I have been running hwloc on IBM Power LPAR VM with only 1 CPU core and 8 PUs [1]. There is only one NUMA node. The numbering is however quite strange, the NUMA node number is "2".  See [2].

hwloc reports "Topology does not contain any NUMA node, aborting!"

$ lstopo
Topology does not contain any NUMA node, aborting!
hwloc_topology_load() failed (No such file or directory).

Could you please double-check if this behavior is correct? I believe hwloc should work on this HW setup.

FYI, we can get it working with --disallowed option [3] (but I think it should work without this option as well)

Thanks a lot!
Jirka


[1] $ lscpu
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  8
Core(s) per socket:  1
Socket(s):           1
NUMA node(s):        1

[2] There is ONE NUMA node with the number "2":
$ numactl -H
available: 1 nodes (2)
node 2 cpus: 0 1 2 3 4 5 6 7
node 2 size: 7614 MB
node 2 free: 1098 MB
node distances:
node   2
 2:  10

[3]
$ lstopo --disallowed
Machine (7615MB total)
 Package L#0
   NUMANode L#0 (P#0 7615MB)
   L3 L#0 (4096KB) + L2 L#0 (1024KB) + Core L#0
     L1d L#0 (32KB) + L1i L#0 (48KB)
       Die L#0 + PU L#0 (P#0)
       PU L#1 (P#2)
       PU L#2 (P#4)
       PU L#3 (P#6)
     L1d L#1 (32KB) + L1i L#1 (48KB)
       PU L#4 (P#1)
       PU L#5 (P#3)
       PU L#6 (P#5)
       PU L#7 (P#7)
 Block(Disk) "sda"
 Net "env2"




_______________________________________________
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel

_______________________________________________
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel

Attachment: OpenPGP_signature
Description: OpenPGP digital signature

_______________________________________________
hwloc-devel mailing list
hwloc-devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-devel

Reply via email to