Brice,
as a side note, what is the rationale for defining the distance as a
floating point number ?
i remember i had to fix a bug in ompi a while ago
/* e.g. replace if (d1 == d2) with if((d1-d2) < epsilon) */
Cheers,
Gilles
On 9/1/2015 5:28 AM, Brice Goglin wrote:
The locality is mlx4_0 as reported by lstopo is "near the entire
machine" (while mlx4_1 is reported near NUMA node #3). I would vote
for buggy PCI-NUMA affinity being reported by the BIOS. But I am not
very familiar with 4x E5-4600 machines so please make sure this PCI
slot is really attached to a single NUMA node (some older 4-socket
machines have some I/O hub attached to 2 sockets).
Given the lspci output, mlx4_0 is likely on the PCI bus attached to
NUMA node #0, so you should be able to work-around the issue by
setting HWLOC_PCI_0000_00_LOCALCPUS=0xfff in the environment.
There are 8 hostbridges in this machine, 2 attached to each processor,
there are likely similar issues for others.
Brice
Le 31/08/2015 22:06, Rolf vandeVaart a écrit :
There was a problem reported on the User's list about Open MPI always
picking one Mellanox card when they were two in the machine.
http://www.open-mpi.org/community/lists/users/2015/08/27507.php
We dug a little deeper and I think this has to do with how hwloc is
figuring out where one of the cards is located. This verbose output
(with some extra printfs) shows that it cannot figure out which NUMA
node mlx4_0 is closest too. It can only determine it is located on
HWLOC_OBJ_SYSTEM and therefore Open MPI assumes a distance of 0.0.
Because of this (smaller is better) Open MPI library always picks
mlx4_0 for all sockets. I am trying to figure out if this is a hwloc
or Open MPI bug. Any thoughts on this?
[node1.local:05821] Checking distance for device=mlx4_1
[node1.local:05821] hwloc_distances->nbobjs=4
[node1.local:05821] hwloc_distances->latency[0]=1.000000
[node1.local:05821] hwloc_distances->latency[1]=2.100000
[node1.local:05821] hwloc_distances->latency[2]=2.100000
[node1.local:05821] hwloc_distances->latency[3]=2.100000
[node1.local:05821] hwloc_distances->latency[4]=2.100000
[node1.local:05821] hwloc_distances->latency[5]=1.000000
[node1.local:05821] hwloc_distances->latency[6]=2.100000
[node1.local:05821] hwloc_distances->latency[7]=2.100000
[node1.local:05821] ibv_obj->type = 4
[node1.local:05821] ibv_obj->logical_index=1
[node1.local:05821] my_obj->logical_index=0
[node1.local:05821] Proc is bound: distance=2.100000
[node1.local:05821] Checking distance for device=mlx4_0
[node1.local:05821] hwloc_distances->nbobjs=4
[node1.local:05821] hwloc_distances->latency[0]=1.000000
[node1.local:05821] hwloc_distances->latency[1]=2.100000
[node1.local:05821] hwloc_distances->latency[2]=2.100000
[node1.local:05821] hwloc_distances->latency[3]=2.100000
[node1.local:05821] hwloc_distances->latency[4]=2.100000
[node1.local:05821] hwloc_distances->latency[5]=1.000000
[node1.local:05821] hwloc_distances->latency[6]=2.100000
[node1.local:05821] hwloc_distances->latency[7]=2.100000
[node1.local:05821] ibv_obj->type = 1
<---------------------HWLOC_OBJ_MACHINE
[node1.local:05821] ibv_obj->type set to NULL
[node1.local:05821] Proc is bound: distance=0.000000
[node1.local:05821] [rank=0] openib: skipping device mlx4_1; it is
too far away
[node1.local:05821] [rank=0] openib: using port mlx4_0:1
[node1.local:05821] [rank=0] openib: using port mlx4_0:2
Machine (1024GB)
NUMANode L#0 (P#0 256GB) + Socket L#0 + L3 L#0 (30MB)
L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU
L#0 (P#0)
L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU
L#1 (P#1)
L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU
L#2 (P#2)
L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU
L#3 (P#3)
L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU
L#4 (P#4)
L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU
L#5 (P#5)
L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU
L#6 (P#6)
L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU
L#7 (P#7)
L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU
L#8 (P#8)
L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU
L#9 (P#9)
L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 +
PU L#10 (P#10)
L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 +
PU L#11 (P#11)
NUMANode L#1 (P#1 256GB)
Socket L#1 + L3 L#1 (30MB)
L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
+ PU L#12 (P#12)
L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
+ PU L#13 (P#13)
L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
+ PU L#14 (P#14)
L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
+ PU L#15 (P#15)
L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
+ PU L#16 (P#16)
L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
+ PU L#17 (P#17)
L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
+ PU L#18 (P#18)
L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
+ PU L#19 (P#19)
L2 L#20 (256KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20
+ PU L#20 (P#20)
L2 L#21 (256KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21
+ PU L#21 (P#21)
L2 L#22 (256KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22
+ PU L#22 (P#22)
L2 L#23 (256KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23
+ PU L#23 (P#23)
HostBridge L#5
PCIBridge
PCI 15b3:1003
Net L#7 "ib2"
Net L#8 "ib3"
OpenFabrics L#9 "mlx4_1"
NUMANode L#2 (P#2 256GB) + Socket L#2 + L3 L#2 (30MB)
L2 L#24 (256KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24 +
PU L#24 (P#24)
L2 L#25 (256KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25 +
PU L#25 (P#25)
L2 L#26 (256KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26 +
PU L#26 (P#26)
L2 L#27 (256KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27 +
PU L#27 (P#27)
L2 L#28 (256KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28 +
PU L#28 (P#28)
L2 L#29 (256KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29 +
PU L#29 (P#29)
L2 L#30 (256KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30 +
PU L#30 (P#30)
L2 L#31 (256KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31 +
PU L#31 (P#31)
L2 L#32 (256KB) + L1d L#32 (32KB) + L1i L#32 (32KB) + Core L#32 +
PU L#32 (P#32)
L2 L#33 (256KB) + L1d L#33 (32KB) + L1i L#33 (32KB) + Core L#33 +
PU L#33 (P#33)
L2 L#34 (256KB) + L1d L#34 (32KB) + L1i L#34 (32KB) + Core L#34 +
PU L#34 (P#34)
L2 L#35 (256KB) + L1d L#35 (32KB) + L1i L#35 (32KB) + Core L#35 +
PU L#35 (P#35)
NUMANode L#3 (P#3 256GB) + Socket L#3 + L3 L#3 (30MB)
L2 L#36 (256KB) + L1d L#36 (32KB) + L1i L#36 (32KB) + Core L#36 +
PU L#36 (P#36)
L2 L#37 (256KB) + L1d L#37 (32KB) + L1i L#37 (32KB) + Core L#37 +
PU L#37 (P#37)
L2 L#38 (256KB) + L1d L#38 (32KB) + L1i L#38 (32KB) + Core L#38 +
PU L#38 (P#38)
L2 L#39 (256KB) + L1d L#39 (32KB) + L1i L#39 (32KB) + Core L#39 +
PU L#39 (P#39)
L2 L#40 (256KB) + L1d L#40 (32KB) + L1i L#40 (32KB) + Core L#40 +
PU L#40 (P#40)
L2 L#41 (256KB) + L1d L#41 (32KB) + L1i L#41 (32KB) + Core L#41 +
PU L#41 (P#41)
L2 L#42 (256KB) + L1d L#42 (32KB) + L1i L#42 (32KB) + Core L#42 +
PU L#42 (P#42)
L2 L#43 (256KB) + L1d L#43 (32KB) + L1i L#43 (32KB) + Core L#43 +
PU L#43 (P#43)
L2 L#44 (256KB) + L1d L#44 (32KB) + L1i L#44 (32KB) + Core L#44 +
PU L#44 (P#44)
L2 L#45 (256KB) + L1d L#45 (32KB) + L1i L#45 (32KB) + Core L#45 +
PU L#45 (P#45)
L2 L#46 (256KB) + L1d L#46 (32KB) + L1i L#46 (32KB) + Core L#46 +
PU L#46 (P#46)
L2 L#47 (256KB) + L1d L#47 (32KB) + L1i L#47 (32KB) + Core L#47 +
PU L#47 (P#47)
HostBridge L#0
PCIBridge
PCI 8086:1528
Net L#0 "eth0"
PCI 8086:1528
Net L#1 "eth1"
PCIBridge
PCI 1000:005d
Block L#2 "sda"
PCIBridge
PCI 15b3:1003
Net L#3 "ib0"
Net L#4 "ib1"
OpenFabrics L#5 "mlx4_0"
PCIBridge
PCI 102b:0522
PCI 19a2:0800
PCI 8086:1d02
Block L#6 "sr0"
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2015/08/17906.php