Hello

Do you know which lstopo is correct here? Do you have a way to know if
the IB interface is indeed connected to first NUMA node of 2nd package,
or to 2nd NUMA node of 1st package? Benchmarking IB bandwidth when
memory/cores are in NUMA node #1 vs #2 would be nice.

The warning/fixup was added to hwloc 1.11 for Haswell-Broadwell-Xeon
bugs in the Linux kernel. It was removed in hwloc 2.x because the kernel
was fixed a while ago. It looks the warning/fixup detection was too
large and also matches Xeon 9200 too, unfortunately.

From what I guess from some diagram on the web, PCI slots on this
platform all go to first package, none to 2nd package. If so, then
lstopo 2.0 is correct and the 1.11.13 fixup should be disabled in this case.

So I guess you should just export the environment variables such as
HWLOC_PCI_0000_40_LOCALCPUS= (empty value) as said in the warning. We're
likely not going to release a 1.11.14 ever, so don't expect a proper fix
for this. We wouldn't be able to test the code on the broken HSW/BDW
platforms anymore anyway.

If you can confirm this through IB benchmarking, it'd be very nice.

Thanks

Brice



Le 29/08/2020 à 14:41, Christian Tuma a écrit :
> Dear hwloc experts,
>
> Using hwloc 1.11.13 I receive an "incorrect PCI locality information"
> error message. The complete message is attached as file
> "lstopo_1.11.13.err".
>
> I get this error on a dual socket Xeon Platinum 9242 system running
> CentOS 7.8.
>
> I don't see this error on a dual socket Xeon Gold 6148 system running
> the same CentOS release (7.8).
>
> And if I remember correctly, I also did not see that error earlier
> with our dual socket Xeon Platinum 9242 system before it was updated
> to version 7.8 of CentOS.
>
> So to me it is the combination of that specific CentOS release (7.8)
> and that particular CPU type (Xeon Platinum 9242) which triggers the
> error in hwloc 1.11.13.
>
> With hwloc 2.1.0, however, I do not see any error message. For your
> reference, I am attaching the XML output files obtained from hwloc
> 1.11.13 and 2.1.0.
>
> Unfortunately, I cannot switch from hwloc 1.x to 2.x because I need to
> compile OpenMPI 3.x where hwloc 1.x is required. And simply setting
> HWLOC_HIDE_ERRORS is not a true solution.
>
> Could someone please provide a fix for this particular problem in
> hwloc 1.x?
>
_______________________________________________
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users

Reply via email to