The locality is mlx4_0 as reported by lstopo is "near the entire
machine" (while mlx4_1 is reported near NUMA node #3). I would vote for
buggy PCI-NUMA affinity being reported by the BIOS. But I am not very
familiar with 4x E5-4600 machines so please make sure this PCI slot is
really attached to a single NUMA node (some older 4-socket machines have
some I/O hub attached to 2 sockets).

Given the lspci output, mlx4_0 is likely on the PCI bus attached to NUMA
node #0, so you should be able to work-around the issue by setting
HWLOC_PCI_0000_00_LOCALCPUS=0xfff in the environment.

There are 8 hostbridges in this machine, 2 attached to each processor,
there are likely similar issues for others.

Brice



Le 31/08/2015 22:06, Rolf vandeVaart a écrit :
>
> There was a problem reported on the User's list about Open MPI always
> picking one Mellanox card when they were two in the machine.
>
>
> http://www.open-mpi.org/community/lists/users/2015/08/27507.php
>
>
> We dug a little deeper and I think this has to do with how hwloc is
> figuring out where one of the cards is located.  This verbose output
> (with some extra printfs) shows that it cannot figure out which NUMA
> node mlx4_0 is closest too. It can only determine it is located on
> HWLOC_OBJ_SYSTEM and therefore Open MPI assumes a distance of 0.0. 
> Because of this (smaller is better) Open MPI library always picks
> mlx4_0 for all sockets.  I am trying to figure out if this is a hwloc
> or Open MPI bug. Any thoughts on this?
>
>
> [node1.local:05821] Checking distance for device=mlx4_1
> [node1.local:05821] hwloc_distances->nbobjs=4
> [node1.local:05821] hwloc_distances->latency[0]=1.000000
> [node1.local:05821] hwloc_distances->latency[1]=2.100000
> [node1.local:05821] hwloc_distances->latency[2]=2.100000
> [node1.local:05821] hwloc_distances->latency[3]=2.100000
> [node1.local:05821] hwloc_distances->latency[4]=2.100000
> [node1.local:05821] hwloc_distances->latency[5]=1.000000
> [node1.local:05821] hwloc_distances->latency[6]=2.100000
> [node1.local:05821] hwloc_distances->latency[7]=2.100000
> [node1.local:05821] ibv_obj->type = 4
> [node1.local:05821] ibv_obj->logical_index=1
> [node1.local:05821] my_obj->logical_index=0
> [node1.local:05821] Proc is bound: distance=2.100000
>
> [node1.local:05821] Checking distance for device=mlx4_0
> [node1.local:05821] hwloc_distances->nbobjs=4
> [node1.local:05821] hwloc_distances->latency[0]=1.000000
> [node1.local:05821] hwloc_distances->latency[1]=2.100000
> [node1.local:05821] hwloc_distances->latency[2]=2.100000
> [node1.local:05821] hwloc_distances->latency[3]=2.100000
> [node1.local:05821] hwloc_distances->latency[4]=2.100000
> [node1.local:05821] hwloc_distances->latency[5]=1.000000
> [node1.local:05821] hwloc_distances->latency[6]=2.100000
> [node1.local:05821] hwloc_distances->latency[7]=2.100000
> [node1.local:05821] ibv_obj->type = 1
> <---------------------HWLOC_OBJ_MACHINE
> [node1.local:05821] ibv_obj->type set to NULL
> [node1.local:05821] Proc is bound: distance=0.000000
>
> [node1.local:05821] [rank=0] openib: skipping device mlx4_1; it is too
> far away
> [node1.local:05821] [rank=0] openib: using port mlx4_0:1
> [node1.local:05821] [rank=0] openib: using port mlx4_0:2
>
>
> Machine (1024GB)
>   NUMANode L#0 (P#0 256GB) + Socket L#0 + L3 L#0 (30MB)
>     L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU
> L#0 (P#0)
>     L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU
> L#1 (P#1)
>     L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU
> L#2 (P#2)
>     L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU
> L#3 (P#3)
>     L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU
> L#4 (P#4)
>     L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU
> L#5 (P#5)
>     L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU
> L#6 (P#6)
>     L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU
> L#7 (P#7)
>     L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU
> L#8 (P#8)
>     L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU
> L#9 (P#9)
>     L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 +
> PU L#10 (P#10)
>     L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 +
> PU L#11 (P#11)
>   NUMANode L#1 (P#1 256GB)
>     Socket L#1 + L3 L#1 (30MB)
>       L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
> + PU L#12 (P#12)
>       L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
> + PU L#13 (P#13)
>       L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
> + PU L#14 (P#14)
>       L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
> + PU L#15 (P#15)
>       L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16
> + PU L#16 (P#16)
>       L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17
> + PU L#17 (P#17)
>       L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18
> + PU L#18 (P#18)
>       L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19
> + PU L#19 (P#19)
>       L2 L#20 (256KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20
> + PU L#20 (P#20)
>       L2 L#21 (256KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21
> + PU L#21 (P#21)
>       L2 L#22 (256KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22
> + PU L#22 (P#22)
>       L2 L#23 (256KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23
> + PU L#23 (P#23)
>     HostBridge L#5
>       PCIBridge
>         PCI 15b3:1003
>           Net L#7 "ib2"
>           Net L#8 "ib3"
>           OpenFabrics L#9 "mlx4_1"
>
>   NUMANode L#2 (P#2 256GB) + Socket L#2 + L3 L#2 (30MB)
>     L2 L#24 (256KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24 +
> PU L#24 (P#24)
>     L2 L#25 (256KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25 +
> PU L#25 (P#25)
>     L2 L#26 (256KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26 +
> PU L#26 (P#26)
>     L2 L#27 (256KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27 +
> PU L#27 (P#27)
>     L2 L#28 (256KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28 +
> PU L#28 (P#28)
>     L2 L#29 (256KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29 +
> PU L#29 (P#29)
>     L2 L#30 (256KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30 +
> PU L#30 (P#30)
>     L2 L#31 (256KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31 +
> PU L#31 (P#31)
>     L2 L#32 (256KB) + L1d L#32 (32KB) + L1i L#32 (32KB) + Core L#32 +
> PU L#32 (P#32)
>     L2 L#33 (256KB) + L1d L#33 (32KB) + L1i L#33 (32KB) + Core L#33 +
> PU L#33 (P#33)
>     L2 L#34 (256KB) + L1d L#34 (32KB) + L1i L#34 (32KB) + Core L#34 +
> PU L#34 (P#34)
>     L2 L#35 (256KB) + L1d L#35 (32KB) + L1i L#35 (32KB) + Core L#35 +
> PU L#35 (P#35)
>   NUMANode L#3 (P#3 256GB) + Socket L#3 + L3 L#3 (30MB)
>     L2 L#36 (256KB) + L1d L#36 (32KB) + L1i L#36 (32KB) + Core L#36 +
> PU L#36 (P#36)
>     L2 L#37 (256KB) + L1d L#37 (32KB) + L1i L#37 (32KB) + Core L#37 +
> PU L#37 (P#37)
>     L2 L#38 (256KB) + L1d L#38 (32KB) + L1i L#38 (32KB) + Core L#38 +
> PU L#38 (P#38)
>     L2 L#39 (256KB) + L1d L#39 (32KB) + L1i L#39 (32KB) + Core L#39 +
> PU L#39 (P#39)
>     L2 L#40 (256KB) + L1d L#40 (32KB) + L1i L#40 (32KB) + Core L#40 +
> PU L#40 (P#40)
>     L2 L#41 (256KB) + L1d L#41 (32KB) + L1i L#41 (32KB) + Core L#41 +
> PU L#41 (P#41)
>     L2 L#42 (256KB) + L1d L#42 (32KB) + L1i L#42 (32KB) + Core L#42 +
> PU L#42 (P#42)
>     L2 L#43 (256KB) + L1d L#43 (32KB) + L1i L#43 (32KB) + Core L#43 +
> PU L#43 (P#43)
>     L2 L#44 (256KB) + L1d L#44 (32KB) + L1i L#44 (32KB) + Core L#44 +
> PU L#44 (P#44)
>     L2 L#45 (256KB) + L1d L#45 (32KB) + L1i L#45 (32KB) + Core L#45 +
> PU L#45 (P#45)
>     L2 L#46 (256KB) + L1d L#46 (32KB) + L1i L#46 (32KB) + Core L#46 +
> PU L#46 (P#46)
>     L2 L#47 (256KB) + L1d L#47 (32KB) + L1i L#47 (32KB) + Core L#47 +
> PU L#47 (P#47)
>   HostBridge L#0
>     PCIBridge
>       PCI 8086:1528
>         Net L#0 "eth0"
>       PCI 8086:1528
>         Net L#1 "eth1"
>     PCIBridge
>       PCI 1000:005d
>         Block L#2 "sda"
>     PCIBridge
>       PCI 15b3:1003
>         Net L#3 "ib0"
>         Net L#4 "ib1"
>         OpenFabrics L#5 "mlx4_0"
>     PCIBridge
>       PCI 102b:0522
>       PCI 19a2:0800
>     PCI 8086:1d02
>       Block L#6 "sr0"
>

Reply via email to