Re: [hwloc-users] hwloc 1.11.13 incorrect PCI locality information Xeon Platinum 9242

2020-09-01 Thread Christian Tuma

Hello Brice,

Thank you for your reply ...

Am 30.08.2020 um 14:24 schrieb Brice Goglin:

...
Do you know which lstopo is correct here?


yes - version 2.x


Do you have a way to know if
the IB interface is indeed connected to first NUMA node of 2nd package,
or to 2nd NUMA node of 1st package?


bcn2226:~ $ opahfirev
##
bcn2226  - HFI :54:00.0
HFI:   hfi1_0
Board: ChipABI 3.0, ChipRev 7.17, SW Compat 3
SN:0x013b8b3b
Location:Discrete  Socket:0 PCISlot:00 NUMANode:1  HFI0
Bus:
GUID:  0011:7509:013b:8b3b
SiRev: B1 (11)
##

=> 2nd NUMA node of 1st package


Benchmarking IB bandwidth when
memory/cores are in NUMA node #1 vs #2 would be nice.


I did that, and it clearly confirms that this PCIslot indeed sits at the 
2nd NUMA node of the 1st package. See the attached plot.



... and the 1.11.13 fixup should be disabled in this case.


OK, I will change line 234 in "pci-common.c" from

if (cpumodel && strstr(cpumodel, "Xeon")) {

to something like

if (cpumodel && strstr(cpumodel, "Xeon") && (strstr(cpumodel, "v3") 
|| strstr(cpumodel, "v4"))) {


to restrict the fixup to Xeon Haswell (v3) and Xeon Broadwell (v4). This 
was the intention you had in mind, correct?



Best wishes,
Christian

--
Dr. Christian Tuma
Consultant, Supercomputing
Zuse Institute Berlin, Takustr. 7, 14195 Berlin, Germany
+49 30 84185132 | t...@zib.de | www.zib.de
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users

[hwloc-users] hwloc 1.11.13 incorrect PCI locality information Xeon Platinum 9242

2020-08-29 Thread Christian Tuma

Dear hwloc experts,

Using hwloc 1.11.13 I receive an "incorrect PCI locality information" 
error message. The complete message is attached as file 
"lstopo_1.11.13.err".


I get this error on a dual socket Xeon Platinum 9242 system running 
CentOS 7.8.


I don't see this error on a dual socket Xeon Gold 6148 system running 
the same CentOS release (7.8).


And if I remember correctly, I also did not see that error earlier with 
our dual socket Xeon Platinum 9242 system before it was updated to 
version 7.8 of CentOS.


So to me it is the combination of that specific CentOS release (7.8) and 
that particular CPU type (Xeon Platinum 9242) which triggers the error 
in hwloc 1.11.13.


With hwloc 2.1.0, however, I do not see any error message. For your 
reference, I am attaching the XML output files obtained from hwloc 
1.11.13 and 2.1.0.


Unfortunately, I cannot switch from hwloc 1.x to 2.x because I need to 
compile OpenMPI 3.x where hwloc 1.x is required. And simply setting 
HWLOC_HIDE_ERRORS is not a true solution.


Could someone please provide a fix for this particular problem in hwloc 1.x?


Thank you in advance -
Christian Tuma

--
Dr. Christian Tuma
Consultant, Supercomputing
Zuse Institute Berlin, Takustr. 7, 14195 Berlin, Germany
+49 30 84185132 | t...@zib.de | www.zib.de

* hwloc 1.11.13 has encountered an incorrect PCI locality information.
* PCI bus :40 is supposedly close to 2nd NUMA node of 1st package,
* however hwloc believes this is impossible on this architecture.
* Therefore the PCI bus will be moved to 1st NUMA node of 2nd package.
*
* If you feel this fixup is wrong, disable it by setting in your environment
* HWLOC_PCI__40_LOCALCPUS= (empty value), and report the problem
* to the hwloc's user mailing list together with the XML output of lstopo.
*
* You may silence this message by setting HWLOC_HIDE_ERRORS=1 in your 
environment.


* hwloc 1.11.13 has encountered an incorrect PCI locality information.
* PCI bus :44 is supposedly close to 2nd NUMA node of 1st package,
* however hwloc believes this is impossible on this architecture.
* Therefore the PCI bus will be moved to 1st NUMA node of 2nd package.
*
* If you feel this fixup is wrong, disable it by setting in your environment
* HWLOC_PCI__44_LOCALCPUS= (empty value), and report the problem
* to the hwloc's user mailing list together with the XML output of lstopo.
*
* You may silence this message by setting HWLOC_HIDE_ERRORS=1 in your 
environment.


* hwloc 1.11.13 has encountered an incorrect PCI locality information.
* PCI bus :53 is supposedly close to 2nd NUMA node of 1st package,
* however hwloc believes this is impossible on this architecture.
* Therefore the PCI bus will be moved to 1st NUMA node of 2nd package.
*
* If you feel this fixup is wrong, disable it by setting in your environment
* HWLOC_PCI__53_LOCALCPUS= (empty value), and report the problem
* to the hwloc's user mailing list together with the XML output of lstopo.
*
* You may silence this message by setting HWLOC_HIDE_ERRORS=1 in your 
environment.


* hwloc 1.11.13 has encountered an incorrect PCI locality information.
* PCI bus :62 is supposedly close to 2nd NUMA node of 1st package,
* however hwloc believes this is impossible on this architecture.
* Therefore the PCI bus will be moved to 1st NUMA node of 2nd package.
*
* If you feel this fixup is wrong, disable it by setting in your environment
* HWLOC_PCI__62_LOCALCPUS= (empty value), and report the problem
* to the hwloc's user mailing list together with the XML output of lstopo.
*
* You may silence this message by setting HWLOC_HIDE_ERRORS=1 in your 
environment.


* hwloc 1.11.13 has encountered an incorrect PCI locality information.
* PCI bus :71 is supposedly close to 2nd NUMA node of 1st package,
* however hwloc believes this is impossible on this architecture.
* Therefore the PCI bus will be moved to 1st NUMA node of 2nd package.
*
* If you feel this fixup is wrong, disable it by setting in your environment
* HWLOC_PCI__71_LOCALCPUS= (empty value), and report the problem
* to the hwloc's user mailing list together with the XML output of lstopo.
*
* You may silence this message by setting HWLOC_HIDE_ERRORS=1