The locality is mlx4_0 as reported by lstopo is "near the entire machine" (while mlx4_1 is reported near NUMA node #3). I would vote for buggy PCI-NUMA affinity being reported by the BIOS. But I am not very familiar with 4x E5-4600 machines so please make sure this PCI slot is really attached to a single NUMA node (some older 4-socket machines have some I/O hub attached to 2 sockets).
Given the lspci output, mlx4_0 is likely on the PCI bus attached to NUMA node #0, so you should be able to work-around the issue by setting HWLOC_PCI_0000_00_LOCALCPUS=0xfff in the environment. There are 8 hostbridges in this machine, 2 attached to each processor, there are likely similar issues for others. Brice Le 31/08/2015 22:06, Rolf vandeVaart a écrit : > > There was a problem reported on the User's list about Open MPI always > picking one Mellanox card when they were two in the machine. > > > http://www.open-mpi.org/community/lists/users/2015/08/27507.php > > > We dug a little deeper and I think this has to do with how hwloc is > figuring out where one of the cards is located. This verbose output > (with some extra printfs) shows that it cannot figure out which NUMA > node mlx4_0 is closest too. It can only determine it is located on > HWLOC_OBJ_SYSTEM and therefore Open MPI assumes a distance of 0.0. > Because of this (smaller is better) Open MPI library always picks > mlx4_0 for all sockets. I am trying to figure out if this is a hwloc > or Open MPI bug. Any thoughts on this? > > > [node1.local:05821] Checking distance for device=mlx4_1 > [node1.local:05821] hwloc_distances->nbobjs=4 > [node1.local:05821] hwloc_distances->latency[0]=1.000000 > [node1.local:05821] hwloc_distances->latency[1]=2.100000 > [node1.local:05821] hwloc_distances->latency[2]=2.100000 > [node1.local:05821] hwloc_distances->latency[3]=2.100000 > [node1.local:05821] hwloc_distances->latency[4]=2.100000 > [node1.local:05821] hwloc_distances->latency[5]=1.000000 > [node1.local:05821] hwloc_distances->latency[6]=2.100000 > [node1.local:05821] hwloc_distances->latency[7]=2.100000 > [node1.local:05821] ibv_obj->type = 4 > [node1.local:05821] ibv_obj->logical_index=1 > [node1.local:05821] my_obj->logical_index=0 > [node1.local:05821] Proc is bound: distance=2.100000 > > [node1.local:05821] Checking distance for device=mlx4_0 > [node1.local:05821] hwloc_distances->nbobjs=4 > [node1.local:05821] hwloc_distances->latency[0]=1.000000 > [node1.local:05821] hwloc_distances->latency[1]=2.100000 > [node1.local:05821] hwloc_distances->latency[2]=2.100000 > [node1.local:05821] hwloc_distances->latency[3]=2.100000 > [node1.local:05821] hwloc_distances->latency[4]=2.100000 > [node1.local:05821] hwloc_distances->latency[5]=1.000000 > [node1.local:05821] hwloc_distances->latency[6]=2.100000 > [node1.local:05821] hwloc_distances->latency[7]=2.100000 > [node1.local:05821] ibv_obj->type = 1 > <---------------------HWLOC_OBJ_MACHINE > [node1.local:05821] ibv_obj->type set to NULL > [node1.local:05821] Proc is bound: distance=0.000000 > > [node1.local:05821] [rank=0] openib: skipping device mlx4_1; it is too > far away > [node1.local:05821] [rank=0] openib: using port mlx4_0:1 > [node1.local:05821] [rank=0] openib: using port mlx4_0:2 > > > Machine (1024GB) > NUMANode L#0 (P#0 256GB) + Socket L#0 + L3 L#0 (30MB) > L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU > L#0 (P#0) > L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU > L#1 (P#1) > L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU > L#2 (P#2) > L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU > L#3 (P#3) > L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU > L#4 (P#4) > L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU > L#5 (P#5) > L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU > L#6 (P#6) > L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU > L#7 (P#7) > L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU > L#8 (P#8) > L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU > L#9 (P#9) > L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + > PU L#10 (P#10) > L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + > PU L#11 (P#11) > NUMANode L#1 (P#1 256GB) > Socket L#1 + L3 L#1 (30MB) > L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 > + PU L#12 (P#12) > L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 > + PU L#13 (P#13) > L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 > + PU L#14 (P#14) > L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 > + PU L#15 (P#15) > L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 > + PU L#16 (P#16) > L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 > + PU L#17 (P#17) > L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 > + PU L#18 (P#18) > L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 > + PU L#19 (P#19) > L2 L#20 (256KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 > + PU L#20 (P#20) > L2 L#21 (256KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 > + PU L#21 (P#21) > L2 L#22 (256KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 > + PU L#22 (P#22) > L2 L#23 (256KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 > + PU L#23 (P#23) > HostBridge L#5 > PCIBridge > PCI 15b3:1003 > Net L#7 "ib2" > Net L#8 "ib3" > OpenFabrics L#9 "mlx4_1" > > NUMANode L#2 (P#2 256GB) + Socket L#2 + L3 L#2 (30MB) > L2 L#24 (256KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24 + > PU L#24 (P#24) > L2 L#25 (256KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25 + > PU L#25 (P#25) > L2 L#26 (256KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26 + > PU L#26 (P#26) > L2 L#27 (256KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27 + > PU L#27 (P#27) > L2 L#28 (256KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28 + > PU L#28 (P#28) > L2 L#29 (256KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29 + > PU L#29 (P#29) > L2 L#30 (256KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30 + > PU L#30 (P#30) > L2 L#31 (256KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31 + > PU L#31 (P#31) > L2 L#32 (256KB) + L1d L#32 (32KB) + L1i L#32 (32KB) + Core L#32 + > PU L#32 (P#32) > L2 L#33 (256KB) + L1d L#33 (32KB) + L1i L#33 (32KB) + Core L#33 + > PU L#33 (P#33) > L2 L#34 (256KB) + L1d L#34 (32KB) + L1i L#34 (32KB) + Core L#34 + > PU L#34 (P#34) > L2 L#35 (256KB) + L1d L#35 (32KB) + L1i L#35 (32KB) + Core L#35 + > PU L#35 (P#35) > NUMANode L#3 (P#3 256GB) + Socket L#3 + L3 L#3 (30MB) > L2 L#36 (256KB) + L1d L#36 (32KB) + L1i L#36 (32KB) + Core L#36 + > PU L#36 (P#36) > L2 L#37 (256KB) + L1d L#37 (32KB) + L1i L#37 (32KB) + Core L#37 + > PU L#37 (P#37) > L2 L#38 (256KB) + L1d L#38 (32KB) + L1i L#38 (32KB) + Core L#38 + > PU L#38 (P#38) > L2 L#39 (256KB) + L1d L#39 (32KB) + L1i L#39 (32KB) + Core L#39 + > PU L#39 (P#39) > L2 L#40 (256KB) + L1d L#40 (32KB) + L1i L#40 (32KB) + Core L#40 + > PU L#40 (P#40) > L2 L#41 (256KB) + L1d L#41 (32KB) + L1i L#41 (32KB) + Core L#41 + > PU L#41 (P#41) > L2 L#42 (256KB) + L1d L#42 (32KB) + L1i L#42 (32KB) + Core L#42 + > PU L#42 (P#42) > L2 L#43 (256KB) + L1d L#43 (32KB) + L1i L#43 (32KB) + Core L#43 + > PU L#43 (P#43) > L2 L#44 (256KB) + L1d L#44 (32KB) + L1i L#44 (32KB) + Core L#44 + > PU L#44 (P#44) > L2 L#45 (256KB) + L1d L#45 (32KB) + L1i L#45 (32KB) + Core L#45 + > PU L#45 (P#45) > L2 L#46 (256KB) + L1d L#46 (32KB) + L1i L#46 (32KB) + Core L#46 + > PU L#46 (P#46) > L2 L#47 (256KB) + L1d L#47 (32KB) + L1i L#47 (32KB) + Core L#47 + > PU L#47 (P#47) > HostBridge L#0 > PCIBridge > PCI 8086:1528 > Net L#0 "eth0" > PCI 8086:1528 > Net L#1 "eth1" > PCIBridge > PCI 1000:005d > Block L#2 "sda" > PCIBridge > PCI 15b3:1003 > Net L#3 "ib0" > Net L#4 "ib1" > OpenFabrics L#5 "mlx4_0" > PCIBridge > PCI 102b:0522 > PCI 19a2:0800 > PCI 8086:1d02 > Block L#6 "sr0" >