Brice,

as a side note, what is the rationale for defining the distance as a floating point number ?

i remember i had to fix a bug in ompi a while ago
/* e.g. replace if (d1 == d2) with if((d1-d2) < epsilon) */

Cheers,

Gilles

On 9/1/2015 5:28 AM, Brice Goglin wrote:
The locality is mlx4_0 as reported by lstopo is "near the entire machine" (while mlx4_1 is reported near NUMA node #3). I would vote for buggy PCI-NUMA affinity being reported by the BIOS. But I am not very familiar with 4x E5-4600 machines so please make sure this PCI slot is really attached to a single NUMA node (some older 4-socket machines have some I/O hub attached to 2 sockets).

Given the lspci output, mlx4_0 is likely on the PCI bus attached to NUMA node #0, so you should be able to work-around the issue by setting HWLOC_PCI_0000_00_LOCALCPUS=0xfff in the environment.

There are 8 hostbridges in this machine, 2 attached to each processor, there are likely similar issues for others.

Brice



Le 31/08/2015 22:06, Rolf vandeVaart a écrit :

There was a problem reported on the User's list about Open MPI always picking one Mellanox card when they were two in the machine.


http://www.open-mpi.org/community/lists/users/2015/08/27507.php


We dug a little deeper and I think this has to do with how hwloc is figuring out where one of the cards is located. This verbose output (with some extra printfs) shows that it cannot figure out which NUMA node mlx4_0 is closest too. It can only determine it is located on HWLOC_OBJ_SYSTEM and therefore Open MPI assumes a distance of 0.0. Because of this (smaller is better) Open MPI library always picks mlx4_0 for all sockets. I am trying to figure out if this is a hwloc or Open MPI bug. Any thoughts on this?


[node1.local:05821] Checking distance for device=mlx4_1
[node1.local:05821] hwloc_distances->nbobjs=4
[node1.local:05821] hwloc_distances->latency[0]=1.000000
[node1.local:05821] hwloc_distances->latency[1]=2.100000
[node1.local:05821] hwloc_distances->latency[2]=2.100000
[node1.local:05821] hwloc_distances->latency[3]=2.100000
[node1.local:05821] hwloc_distances->latency[4]=2.100000
[node1.local:05821] hwloc_distances->latency[5]=1.000000
[node1.local:05821] hwloc_distances->latency[6]=2.100000
[node1.local:05821] hwloc_distances->latency[7]=2.100000
[node1.local:05821] ibv_obj->type = 4
[node1.local:05821] ibv_obj->logical_index=1
[node1.local:05821] my_obj->logical_index=0
[node1.local:05821] Proc is bound: distance=2.100000

[node1.local:05821] Checking distance for device=mlx4_0
[node1.local:05821] hwloc_distances->nbobjs=4
[node1.local:05821] hwloc_distances->latency[0]=1.000000
[node1.local:05821] hwloc_distances->latency[1]=2.100000
[node1.local:05821] hwloc_distances->latency[2]=2.100000
[node1.local:05821] hwloc_distances->latency[3]=2.100000
[node1.local:05821] hwloc_distances->latency[4]=2.100000
[node1.local:05821] hwloc_distances->latency[5]=1.000000
[node1.local:05821] hwloc_distances->latency[6]=2.100000
[node1.local:05821] hwloc_distances->latency[7]=2.100000
[node1.local:05821] ibv_obj->type = 1 <---------------------HWLOC_OBJ_MACHINE
[node1.local:05821] ibv_obj->type set to NULL
[node1.local:05821] Proc is bound: distance=0.000000

[node1.local:05821] [rank=0] openib: skipping device mlx4_1; it is too far away
[node1.local:05821] [rank=0] openib: using port mlx4_0:1
[node1.local:05821] [rank=0] openib: using port mlx4_0:2


Machine (1024GB)
  NUMANode L#0 (P#0 256GB) + Socket L#0 + L3 L#0 (30MB)
L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0) L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1) L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2) L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3) L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4) L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5) L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6) L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7) L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#8) L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#9) L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#10) L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#11)
  NUMANode L#1 (P#1 256GB)
    Socket L#1 + L3 L#1 (30MB)
L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#12) L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#13) L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#14) L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#15) L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 + PU L#16 (P#16) L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 + PU L#17 (P#17) L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 + PU L#18 (P#18) L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 + PU L#19 (P#19) L2 L#20 (256KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 + PU L#20 (P#20) L2 L#21 (256KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 + PU L#21 (P#21) L2 L#22 (256KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 + PU L#22 (P#22) L2 L#23 (256KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 + PU L#23 (P#23)
    HostBridge L#5
      PCIBridge
        PCI 15b3:1003
          Net L#7 "ib2"
          Net L#8 "ib3"
          OpenFabrics L#9 "mlx4_1"

  NUMANode L#2 (P#2 256GB) + Socket L#2 + L3 L#2 (30MB)
L2 L#24 (256KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24 + PU L#24 (P#24) L2 L#25 (256KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25 + PU L#25 (P#25) L2 L#26 (256KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26 + PU L#26 (P#26) L2 L#27 (256KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27 + PU L#27 (P#27) L2 L#28 (256KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28 + PU L#28 (P#28) L2 L#29 (256KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29 + PU L#29 (P#29) L2 L#30 (256KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30 + PU L#30 (P#30) L2 L#31 (256KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31 + PU L#31 (P#31) L2 L#32 (256KB) + L1d L#32 (32KB) + L1i L#32 (32KB) + Core L#32 + PU L#32 (P#32) L2 L#33 (256KB) + L1d L#33 (32KB) + L1i L#33 (32KB) + Core L#33 + PU L#33 (P#33) L2 L#34 (256KB) + L1d L#34 (32KB) + L1i L#34 (32KB) + Core L#34 + PU L#34 (P#34) L2 L#35 (256KB) + L1d L#35 (32KB) + L1i L#35 (32KB) + Core L#35 + PU L#35 (P#35)
  NUMANode L#3 (P#3 256GB) + Socket L#3 + L3 L#3 (30MB)
L2 L#36 (256KB) + L1d L#36 (32KB) + L1i L#36 (32KB) + Core L#36 + PU L#36 (P#36) L2 L#37 (256KB) + L1d L#37 (32KB) + L1i L#37 (32KB) + Core L#37 + PU L#37 (P#37) L2 L#38 (256KB) + L1d L#38 (32KB) + L1i L#38 (32KB) + Core L#38 + PU L#38 (P#38) L2 L#39 (256KB) + L1d L#39 (32KB) + L1i L#39 (32KB) + Core L#39 + PU L#39 (P#39) L2 L#40 (256KB) + L1d L#40 (32KB) + L1i L#40 (32KB) + Core L#40 + PU L#40 (P#40) L2 L#41 (256KB) + L1d L#41 (32KB) + L1i L#41 (32KB) + Core L#41 + PU L#41 (P#41) L2 L#42 (256KB) + L1d L#42 (32KB) + L1i L#42 (32KB) + Core L#42 + PU L#42 (P#42) L2 L#43 (256KB) + L1d L#43 (32KB) + L1i L#43 (32KB) + Core L#43 + PU L#43 (P#43) L2 L#44 (256KB) + L1d L#44 (32KB) + L1i L#44 (32KB) + Core L#44 + PU L#44 (P#44) L2 L#45 (256KB) + L1d L#45 (32KB) + L1i L#45 (32KB) + Core L#45 + PU L#45 (P#45) L2 L#46 (256KB) + L1d L#46 (32KB) + L1i L#46 (32KB) + Core L#46 + PU L#46 (P#46) L2 L#47 (256KB) + L1d L#47 (32KB) + L1i L#47 (32KB) + Core L#47 + PU L#47 (P#47)
  HostBridge L#0
    PCIBridge
      PCI 8086:1528
        Net L#0 "eth0"
      PCI 8086:1528
        Net L#1 "eth1"
    PCIBridge
      PCI 1000:005d
        Block L#2 "sda"
    PCIBridge
      PCI 15b3:1003
        Net L#3 "ib0"
        Net L#4 "ib1"
        OpenFabrics L#5 "mlx4_0"
    PCIBridge
      PCI 102b:0522
      PCI 19a2:0800
    PCI 8086:1d02
      Block L#6 "sr0"




_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2015/08/17906.php

Reply via email to