There was a problem reported on the User's list about Open MPI always picking 
one Mellanox card when they were two in the machine.


http://www.open-mpi.org/community/lists/users/2015/08/27507.php


We dug a little deeper and I think this has to do with how hwloc is figuring 
out where one of the cards is located.  This verbose output (with some extra 
printfs) shows that it cannot figure out which NUMA node mlx4_0 is closest too. 
It can only determine it is located on HWLOC_OBJ_SYSTEM and therefore Open MPI 
assumes a distance of 0.0.  Because of this (smaller is better) Open MPI 
library always picks mlx4_0 for all sockets.  I am trying to figure out if this 
is a hwloc or Open MPI bug. Any thoughts on this?


[node1.local:05821] Checking distance for device=mlx4_1
[node1.local:05821] hwloc_distances->nbobjs=4
[node1.local:05821] hwloc_distances->latency[0]=1.000000
[node1.local:05821] hwloc_distances->latency[1]=2.100000
[node1.local:05821] hwloc_distances->latency[2]=2.100000
[node1.local:05821] hwloc_distances->latency[3]=2.100000
[node1.local:05821] hwloc_distances->latency[4]=2.100000
[node1.local:05821] hwloc_distances->latency[5]=1.000000
[node1.local:05821] hwloc_distances->latency[6]=2.100000
[node1.local:05821] hwloc_distances->latency[7]=2.100000
[node1.local:05821] ibv_obj->type = 4
[node1.local:05821] ibv_obj->logical_index=1
[node1.local:05821] my_obj->logical_index=0
[node1.local:05821] Proc is bound: distance=2.100000

[node1.local:05821] Checking distance for device=mlx4_0
[node1.local:05821] hwloc_distances->nbobjs=4
[node1.local:05821] hwloc_distances->latency[0]=1.000000
[node1.local:05821] hwloc_distances->latency[1]=2.100000
[node1.local:05821] hwloc_distances->latency[2]=2.100000
[node1.local:05821] hwloc_distances->latency[3]=2.100000
[node1.local:05821] hwloc_distances->latency[4]=2.100000
[node1.local:05821] hwloc_distances->latency[5]=1.000000
[node1.local:05821] hwloc_distances->latency[6]=2.100000
[node1.local:05821] hwloc_distances->latency[7]=2.100000
[node1.local:05821] ibv_obj->type = 1 <---------------------HWLOC_OBJ_MACHINE
[node1.local:05821] ibv_obj->type set to NULL
[node1.local:05821] Proc is bound: distance=0.000000

[node1.local:05821] [rank=0] openib: skipping device mlx4_1; it is too far away
[node1.local:05821] [rank=0] openib: using port mlx4_0:1
[node1.local:05821] [rank=0] openib: using port mlx4_0:2


Machine (1024GB)
  NUMANode L#0 (P#0 256GB) + Socket L#0 + L3 L#0 (30MB)
    L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
    L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)
    L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2)
    L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3)
    L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4)
    L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5)
    L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6)
    L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7)
    L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#8)
    L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#9)
    L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 
(P#10)
    L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 
(P#11)
  NUMANode L#1 (P#1 256GB)
    Socket L#1 + L3 L#1 (30MB)
      L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 
(P#12)
      L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 
(P#13)
      L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 
(P#14)
      L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 
(P#15)
      L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 + PU L#16 
(P#16)
      L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 + PU L#17 
(P#17)
      L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 + PU L#18 
(P#18)
      L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 + PU L#19 
(P#19)
      L2 L#20 (256KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 + PU L#20 
(P#20)
      L2 L#21 (256KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 + PU L#21 
(P#21)
      L2 L#22 (256KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 + PU L#22 
(P#22)
      L2 L#23 (256KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 + PU L#23 
(P#23)
    HostBridge L#5
      PCIBridge
        PCI 15b3:1003
          Net L#7 "ib2"
          Net L#8 "ib3"
          OpenFabrics L#9 "mlx4_1"

  NUMANode L#2 (P#2 256GB) + Socket L#2 + L3 L#2 (30MB)
    L2 L#24 (256KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24 + PU L#24 
(P#24)
    L2 L#25 (256KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25 + PU L#25 
(P#25)
    L2 L#26 (256KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26 + PU L#26 
(P#26)
    L2 L#27 (256KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27 + PU L#27 
(P#27)
    L2 L#28 (256KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28 + PU L#28 
(P#28)
    L2 L#29 (256KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29 + PU L#29 
(P#29)
    L2 L#30 (256KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30 + PU L#30 
(P#30)
    L2 L#31 (256KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31 + PU L#31 
(P#31)
    L2 L#32 (256KB) + L1d L#32 (32KB) + L1i L#32 (32KB) + Core L#32 + PU L#32 
(P#32)
    L2 L#33 (256KB) + L1d L#33 (32KB) + L1i L#33 (32KB) + Core L#33 + PU L#33 
(P#33)
    L2 L#34 (256KB) + L1d L#34 (32KB) + L1i L#34 (32KB) + Core L#34 + PU L#34 
(P#34)
    L2 L#35 (256KB) + L1d L#35 (32KB) + L1i L#35 (32KB) + Core L#35 + PU L#35 
(P#35)
  NUMANode L#3 (P#3 256GB) + Socket L#3 + L3 L#3 (30MB)
    L2 L#36 (256KB) + L1d L#36 (32KB) + L1i L#36 (32KB) + Core L#36 + PU L#36 
(P#36)
    L2 L#37 (256KB) + L1d L#37 (32KB) + L1i L#37 (32KB) + Core L#37 + PU L#37 
(P#37)
    L2 L#38 (256KB) + L1d L#38 (32KB) + L1i L#38 (32KB) + Core L#38 + PU L#38 
(P#38)
    L2 L#39 (256KB) + L1d L#39 (32KB) + L1i L#39 (32KB) + Core L#39 + PU L#39 
(P#39)
    L2 L#40 (256KB) + L1d L#40 (32KB) + L1i L#40 (32KB) + Core L#40 + PU L#40 
(P#40)
    L2 L#41 (256KB) + L1d L#41 (32KB) + L1i L#41 (32KB) + Core L#41 + PU L#41 
(P#41)
    L2 L#42 (256KB) + L1d L#42 (32KB) + L1i L#42 (32KB) + Core L#42 + PU L#42 
(P#42)
    L2 L#43 (256KB) + L1d L#43 (32KB) + L1i L#43 (32KB) + Core L#43 + PU L#43 
(P#43)
    L2 L#44 (256KB) + L1d L#44 (32KB) + L1i L#44 (32KB) + Core L#44 + PU L#44 
(P#44)
    L2 L#45 (256KB) + L1d L#45 (32KB) + L1i L#45 (32KB) + Core L#45 + PU L#45 
(P#45)
    L2 L#46 (256KB) + L1d L#46 (32KB) + L1i L#46 (32KB) + Core L#46 + PU L#46 
(P#46)
    L2 L#47 (256KB) + L1d L#47 (32KB) + L1i L#47 (32KB) + Core L#47 + PU L#47 
(P#47)
  HostBridge L#0
    PCIBridge
      PCI 8086:1528
        Net L#0 "eth0"
      PCI 8086:1528
        Net L#1 "eth1"
    PCIBridge
      PCI 1000:005d
        Block L#2 "sda"
    PCIBridge
      PCI 15b3:1003
        Net L#3 "ib0"
        Net L#4 "ib1"
        OpenFabrics L#5 "mlx4_0"
    PCIBridge
      PCI 102b:0522
      PCI 19a2:0800
    PCI 8086:1d02
      Block L#6 "sr0"



-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

Reply via email to