Hello

It's a float because we normalize to 1 on the diagonal (some AMD
machines have values like 10 on the diagonal and 16 or 22 otherwise, so
you ge 1.0, 1.6 or 2.2 after normalization), and also because some users
wanted to specify their own distance matrix.

I'd like to cleanup the distance API in hwloc 2.0. Current ideas are:
1) Removing normalization+float ? Should be possible.
2) Only supporting distance matrices that cover the entire machine ?
Likely fine too.
3) Remove the ability for the users to specify distances manually ? It's
useful for adding locality based on benchmarks when the BIOS/kernel
doesn't report enough. Need to talk with users.
4) Only support NUMA distances. Depends on (3).
Comments are welcome.

Brice



Le 01/09/2015 01:50, Gilles Gouaillardet a écrit :
> Brice,
>
> as a side note, what is the rationale for defining the distance as a
> floating point number ?
>
> i remember i had to fix a bug in ompi a while ago
> /* e.g. replace if (d1 == d2) with if((d1-d2) < epsilon) */
>
> Cheers,
>
> Gilles
>
> On 9/1/2015 5:28 AM, Brice Goglin wrote:
>> The locality is mlx4_0 as reported by lstopo is "near the entire
>> machine" (while mlx4_1 is reported near NUMA node #3). I would vote
>> for buggy PCI-NUMA affinity being reported by the BIOS. But I am not
>> very familiar with 4x E5-4600 machines so please make sure this PCI
>> slot is really attached to a single NUMA node (some older 4-socket
>> machines have some I/O hub attached to 2 sockets).
>>
>> Given the lspci output, mlx4_0 is likely on the PCI bus attached to
>> NUMA node #0, so you should be able to work-around the issue by
>> setting HWLOC_PCI_0000_00_LOCALCPUS=0xfff in the environment.
>>
>> There are 8 hostbridges in this machine, 2 attached to each
>> processor, there are likely similar issues for others.
>>
>> Brice
>>
>>
>>
>> Le 31/08/2015 22:06, Rolf vandeVaart a écrit :
>>>
>>> There was a problem reported on the User's list about Open MPI
>>> always picking one Mellanox card when they were two in the machine.
>>>
>>>
>>> http://www.open-mpi.org/community/lists/users/2015/08/27507.php
>>>
>>>
>>> We dug a little deeper and I think this has to do with how hwloc is
>>> figuring out where one of the cards is located.  This verbose output
>>> (with some extra printfs) shows that it cannot figure out which NUMA
>>> node mlx4_0 is closest too. It can only determine it is located on
>>> HWLOC_OBJ_SYSTEM and therefore Open MPI assumes a distance of 0.0. 
>>> Because of this (smaller is better) Open MPI library always picks
>>> mlx4_0 for all sockets.  I am trying to figure out if this is a
>>> hwloc or Open MPI bug. Any thoughts on this?
>>>
>>>
>>> [node1.local:05821] Checking distance for device=mlx4_1
>>> [node1.local:05821] hwloc_distances->nbobjs=4
>>> [node1.local:05821] hwloc_distances->latency[0]=1.000000
>>> [node1.local:05821] hwloc_distances->latency[1]=2.100000
>>> [node1.local:05821] hwloc_distances->latency[2]=2.100000
>>> [node1.local:05821] hwloc_distances->latency[3]=2.100000
>>> [node1.local:05821] hwloc_distances->latency[4]=2.100000
>>> [node1.local:05821] hwloc_distances->latency[5]=1.000000
>>> [node1.local:05821] hwloc_distances->latency[6]=2.100000
>>> [node1.local:05821] hwloc_distances->latency[7]=2.100000
>>> [node1.local:05821] ibv_obj->type = 4
>>> [node1.local:05821] ibv_obj->logical_index=1
>>> [node1.local:05821] my_obj->logical_index=0
>>> [node1.local:05821] Proc is bound: distance=2.100000
>>>
>>> [node1.local:05821] Checking distance for device=mlx4_0
>>> [node1.local:05821] hwloc_distances->nbobjs=4
>>> [node1.local:05821] hwloc_distances->latency[0]=1.000000
>>> [node1.local:05821] hwloc_distances->latency[1]=2.100000
>>> [node1.local:05821] hwloc_distances->latency[2]=2.100000
>>> [node1.local:05821] hwloc_distances->latency[3]=2.100000
>>> [node1.local:05821] hwloc_distances->latency[4]=2.100000
>>> [node1.local:05821] hwloc_distances->latency[5]=1.000000
>>> [node1.local:05821] hwloc_distances->latency[6]=2.100000
>>> [node1.local:05821] hwloc_distances->latency[7]=2.100000
>>> [node1.local:05821] ibv_obj->type = 1
>>> <---------------------HWLOC_OBJ_MACHINE
>>> [node1.local:05821] ibv_obj->type set to NULL
>>> [node1.local:05821] Proc is bound: distance=0.000000
>>>
>>> [node1.local:05821] [rank=0] openib: skipping device mlx4_1; it is
>>> too far away
>>> [node1.local:05821] [rank=0] openib: using port mlx4_0:1
>>> [node1.local:05821] [rank=0] openib: using port mlx4_0:2
>>>
>>>
>>> Machine (1024GB)
>>>   NUMANode L#0 (P#0 256GB) + Socket L#0 + L3 L#0 (30MB)
>>>     L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU
>>> L#0 (P#0)
>>>     L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU
>>> L#1 (P#1)
>>>     L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU
>>> L#2 (P#2)
>>>     L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU
>>> L#3 (P#3)
>>>     L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU
>>> L#4 (P#4)
>>>     L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU
>>> L#5 (P#5)
>>>     L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU
>>> L#6 (P#6)
>>>     L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU
>>> L#7 (P#7)
>>>     L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU
>>> L#8 (P#8)
>>>     L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU
>>> L#9 (P#9)
>>>     L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
>>> + PU L#10 (P#10)
>>>     L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
>>> + PU L#11 (P#11)
>>>   NUMANode L#1 (P#1 256GB)
>>>     Socket L#1 + L3 L#1 (30MB)
>>>       L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core
>>> L#12 + PU L#12 (P#12)
>>>       L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core
>>> L#13 + PU L#13 (P#13)
>>>       L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core
>>> L#14 + PU L#14 (P#14)
>>>       L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core
>>> L#15 + PU L#15 (P#15)
>>>       L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core
>>> L#16 + PU L#16 (P#16)
>>>       L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core
>>> L#17 + PU L#17 (P#17)
>>>       L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core
>>> L#18 + PU L#18 (P#18)
>>>       L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core
>>> L#19 + PU L#19 (P#19)
>>>       L2 L#20 (256KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core
>>> L#20 + PU L#20 (P#20)
>>>       L2 L#21 (256KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core
>>> L#21 + PU L#21 (P#21)
>>>       L2 L#22 (256KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core
>>> L#22 + PU L#22 (P#22)
>>>       L2 L#23 (256KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core
>>> L#23 + PU L#23 (P#23)
>>>     HostBridge L#5
>>>       PCIBridge
>>>         PCI 15b3:1003
>>>           Net L#7 "ib2"
>>>           Net L#8 "ib3"
>>>           OpenFabrics L#9 "mlx4_1"
>>>
>>>   NUMANode L#2 (P#2 256GB) + Socket L#2 + L3 L#2 (30MB)
>>>     L2 L#24 (256KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24
>>> + PU L#24 (P#24)
>>>     L2 L#25 (256KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25
>>> + PU L#25 (P#25)
>>>     L2 L#26 (256KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26
>>> + PU L#26 (P#26)
>>>     L2 L#27 (256KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27
>>> + PU L#27 (P#27)
>>>     L2 L#28 (256KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28
>>> + PU L#28 (P#28)
>>>     L2 L#29 (256KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29
>>> + PU L#29 (P#29)
>>>     L2 L#30 (256KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30
>>> + PU L#30 (P#30)
>>>     L2 L#31 (256KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31
>>> + PU L#31 (P#31)
>>>     L2 L#32 (256KB) + L1d L#32 (32KB) + L1i L#32 (32KB) + Core L#32
>>> + PU L#32 (P#32)
>>>     L2 L#33 (256KB) + L1d L#33 (32KB) + L1i L#33 (32KB) + Core L#33
>>> + PU L#33 (P#33)
>>>     L2 L#34 (256KB) + L1d L#34 (32KB) + L1i L#34 (32KB) + Core L#34
>>> + PU L#34 (P#34)
>>>     L2 L#35 (256KB) + L1d L#35 (32KB) + L1i L#35 (32KB) + Core L#35
>>> + PU L#35 (P#35)
>>>   NUMANode L#3 (P#3 256GB) + Socket L#3 + L3 L#3 (30MB)
>>>     L2 L#36 (256KB) + L1d L#36 (32KB) + L1i L#36 (32KB) + Core L#36
>>> + PU L#36 (P#36)
>>>     L2 L#37 (256KB) + L1d L#37 (32KB) + L1i L#37 (32KB) + Core L#37
>>> + PU L#37 (P#37)
>>>     L2 L#38 (256KB) + L1d L#38 (32KB) + L1i L#38 (32KB) + Core L#38
>>> + PU L#38 (P#38)
>>>     L2 L#39 (256KB) + L1d L#39 (32KB) + L1i L#39 (32KB) + Core L#39
>>> + PU L#39 (P#39)
>>>     L2 L#40 (256KB) + L1d L#40 (32KB) + L1i L#40 (32KB) + Core L#40
>>> + PU L#40 (P#40)
>>>     L2 L#41 (256KB) + L1d L#41 (32KB) + L1i L#41 (32KB) + Core L#41
>>> + PU L#41 (P#41)
>>>     L2 L#42 (256KB) + L1d L#42 (32KB) + L1i L#42 (32KB) + Core L#42
>>> + PU L#42 (P#42)
>>>     L2 L#43 (256KB) + L1d L#43 (32KB) + L1i L#43 (32KB) + Core L#43
>>> + PU L#43 (P#43)
>>>     L2 L#44 (256KB) + L1d L#44 (32KB) + L1i L#44 (32KB) + Core L#44
>>> + PU L#44 (P#44)
>>>     L2 L#45 (256KB) + L1d L#45 (32KB) + L1i L#45 (32KB) + Core L#45
>>> + PU L#45 (P#45)
>>>     L2 L#46 (256KB) + L1d L#46 (32KB) + L1i L#46 (32KB) + Core L#46
>>> + PU L#46 (P#46)
>>>     L2 L#47 (256KB) + L1d L#47 (32KB) + L1i L#47 (32KB) + Core L#47
>>> + PU L#47 (P#47)
>>>   HostBridge L#0
>>>     PCIBridge
>>>       PCI 8086:1528
>>>         Net L#0 "eth0"
>>>       PCI 8086:1528
>>>         Net L#1 "eth1"
>>>     PCIBridge
>>>       PCI 1000:005d
>>>         Block L#2 "sda"
>>>     PCIBridge
>>>       PCI 15b3:1003
>>>         Net L#3 "ib0"
>>>         Net L#4 "ib1"
>>>         OpenFabrics L#5 "mlx4_0"
>>>     PCIBridge
>>>       PCI 102b:0522
>>>       PCI 19a2:0800
>>>     PCI 8086:1d02
>>>       Block L#6 "sr0"
>>>
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2015/08/17906.php
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/08/17907.php

Reply via email to