What is the output of /sbin/lspci -tv?

On Aug 31, 2015, at 4:06 PM, Rolf vandeVaart <rvandeva...@nvidia.com> wrote:

> There was a problem reported on the User's list about Open MPI always picking 
> one Mellanox card when they were two in the machine.
> 
> http://www.open-mpi.org/community/lists/users/2015/08/27507.php
> 
> We dug a little deeper and I think this has to do with how hwloc is figuring 
> out where one of the cards is located.  This verbose output (with some extra 
> printfs) shows that it cannot figure out which NUMA node mlx4_0 is closest 
> too. It can only determine it is located on HWLOC_OBJ_SYSTEM and therefore 
> Open MPI assumes a distance of 0.0.  Because of this (smaller is better) Open 
> MPI library always picks mlx4_0 for all sockets.  I am trying to figure out 
> if this is a hwloc or Open MPI bug. Any thoughts on this?
> 
> [node1.local:05821] Checking distance for device=mlx4_1
> [node1.local:05821] hwloc_distances->nbobjs=4
> [node1.local:05821] hwloc_distances->latency[0]=1.000000
> [node1.local:05821] hwloc_distances->latency[1]=2.100000
> [node1.local:05821] hwloc_distances->latency[2]=2.100000
> [node1.local:05821] hwloc_distances->latency[3]=2.100000
> [node1.local:05821] hwloc_distances->latency[4]=2.100000
> [node1.local:05821] hwloc_distances->latency[5]=1.000000
> [node1.local:05821] hwloc_distances->latency[6]=2.100000
> [node1.local:05821] hwloc_distances->latency[7]=2.100000
> [node1.local:05821] ibv_obj->type = 4
> [node1.local:05821] ibv_obj->logical_index=1
> [node1.local:05821] my_obj->logical_index=0
> [node1.local:05821] Proc is bound: distance=2.100000
> 
> [node1.local:05821] Checking distance for device=mlx4_0
> [node1.local:05821] hwloc_distances->nbobjs=4
> [node1.local:05821] hwloc_distances->latency[0]=1.000000
> [node1.local:05821] hwloc_distances->latency[1]=2.100000
> [node1.local:05821] hwloc_distances->latency[2]=2.100000
> [node1.local:05821] hwloc_distances->latency[3]=2.100000
> [node1.local:05821] hwloc_distances->latency[4]=2.100000
> [node1.local:05821] hwloc_distances->latency[5]=1.000000
> [node1.local:05821] hwloc_distances->latency[6]=2.100000
> [node1.local:05821] hwloc_distances->latency[7]=2.100000
> [node1.local:05821] ibv_obj->type = 1 <---------------------HWLOC_OBJ_MACHINE
> [node1.local:05821] ibv_obj->type set to NULL
> [node1.local:05821] Proc is bound: distance=0.000000
> [node1.local:05821] [rank=0] openib: skipping device mlx4_1; it is too far 
> away
> [node1.local:05821] [rank=0] openib: using port mlx4_0:1
> [node1.local:05821] [rank=0] openib: using port mlx4_0:2
> 
> Machine (1024GB)
>   NUMANode L#0 (P#0 256GB) + Socket L#0 + L3 L#0 (30MB)
>     L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
>     L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)
>     L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2)
>     L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3)
>     L2 L#4 (256KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4)
>     L2 L#5 (256KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5)
>     L2 L#6 (256KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6)
>     L2 L#7 (256KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7)
>     L2 L#8 (256KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#8)
>     L2 L#9 (256KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#9)
>     L2 L#10 (256KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 
> (P#10)
>     L2 L#11 (256KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 
> (P#11)
>   NUMANode L#1 (P#1 256GB)
>     Socket L#1 + L3 L#1 (30MB)
>       L2 L#12 (256KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU 
> L#12 (P#12)
>       L2 L#13 (256KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU 
> L#13 (P#13)
>       L2 L#14 (256KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU 
> L#14 (P#14)
>       L2 L#15 (256KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU 
> L#15 (P#15)
>       L2 L#16 (256KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 + PU 
> L#16 (P#16)
>       L2 L#17 (256KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 + PU 
> L#17 (P#17)
>       L2 L#18 (256KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 + PU 
> L#18 (P#18)
>       L2 L#19 (256KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 + PU 
> L#19 (P#19)
>       L2 L#20 (256KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 + PU 
> L#20 (P#20)
>       L2 L#21 (256KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 + PU 
> L#21 (P#21)
>       L2 L#22 (256KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 + PU 
> L#22 (P#22)
>       L2 L#23 (256KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 + PU 
> L#23 (P#23)
>     HostBridge L#5
>       PCIBridge
>         PCI 15b3:1003
>           Net L#7 "ib2"
>           Net L#8 "ib3"
>           OpenFabrics L#9 "mlx4_1"
>   NUMANode L#2 (P#2 256GB) + Socket L#2 + L3 L#2 (30MB)
>     L2 L#24 (256KB) + L1d L#24 (32KB) + L1i L#24 (32KB) + Core L#24 + PU L#24 
> (P#24)
>     L2 L#25 (256KB) + L1d L#25 (32KB) + L1i L#25 (32KB) + Core L#25 + PU L#25 
> (P#25)
>     L2 L#26 (256KB) + L1d L#26 (32KB) + L1i L#26 (32KB) + Core L#26 + PU L#26 
> (P#26)
>     L2 L#27 (256KB) + L1d L#27 (32KB) + L1i L#27 (32KB) + Core L#27 + PU L#27 
> (P#27)
>     L2 L#28 (256KB) + L1d L#28 (32KB) + L1i L#28 (32KB) + Core L#28 + PU L#28 
> (P#28)
>     L2 L#29 (256KB) + L1d L#29 (32KB) + L1i L#29 (32KB) + Core L#29 + PU L#29 
> (P#29)
>     L2 L#30 (256KB) + L1d L#30 (32KB) + L1i L#30 (32KB) + Core L#30 + PU L#30 
> (P#30)
>     L2 L#31 (256KB) + L1d L#31 (32KB) + L1i L#31 (32KB) + Core L#31 + PU L#31 
> (P#31)
>     L2 L#32 (256KB) + L1d L#32 (32KB) + L1i L#32 (32KB) + Core L#32 + PU L#32 
> (P#32)
>     L2 L#33 (256KB) + L1d L#33 (32KB) + L1i L#33 (32KB) + Core L#33 + PU L#33 
> (P#33)
>     L2 L#34 (256KB) + L1d L#34 (32KB) + L1i L#34 (32KB) + Core L#34 + PU L#34 
> (P#34)
>     L2 L#35 (256KB) + L1d L#35 (32KB) + L1i L#35 (32KB) + Core L#35 + PU L#35 
> (P#35)
>   NUMANode L#3 (P#3 256GB) + Socket L#3 + L3 L#3 (30MB)
>     L2 L#36 (256KB) + L1d L#36 (32KB) + L1i L#36 (32KB) + Core L#36 + PU L#36 
> (P#36)
>     L2 L#37 (256KB) + L1d L#37 (32KB) + L1i L#37 (32KB) + Core L#37 + PU L#37 
> (P#37)
>     L2 L#38 (256KB) + L1d L#38 (32KB) + L1i L#38 (32KB) + Core L#38 + PU L#38 
> (P#38)
>     L2 L#39 (256KB) + L1d L#39 (32KB) + L1i L#39 (32KB) + Core L#39 + PU L#39 
> (P#39)
>     L2 L#40 (256KB) + L1d L#40 (32KB) + L1i L#40 (32KB) + Core L#40 + PU L#40 
> (P#40)
>     L2 L#41 (256KB) + L1d L#41 (32KB) + L1i L#41 (32KB) + Core L#41 + PU L#41 
> (P#41)
>     L2 L#42 (256KB) + L1d L#42 (32KB) + L1i L#42 (32KB) + Core L#42 + PU L#42 
> (P#42)
>     L2 L#43 (256KB) + L1d L#43 (32KB) + L1i L#43 (32KB) + Core L#43 + PU L#43 
> (P#43)
>     L2 L#44 (256KB) + L1d L#44 (32KB) + L1i L#44 (32KB) + Core L#44 + PU L#44 
> (P#44)
>     L2 L#45 (256KB) + L1d L#45 (32KB) + L1i L#45 (32KB) + Core L#45 + PU L#45 
> (P#45)
>     L2 L#46 (256KB) + L1d L#46 (32KB) + L1i L#46 (32KB) + Core L#46 + PU L#46 
> (P#46)
>     L2 L#47 (256KB) + L1d L#47 (32KB) + L1i L#47 (32KB) + Core L#47 + PU L#47 
> (P#47)
>   HostBridge L#0
>     PCIBridge
>       PCI 8086:1528
>         Net L#0 "eth0"
>       PCI 8086:1528
>         Net L#1 "eth1"
>     PCIBridge
>       PCI 1000:005d
>         Block L#2 "sda"
>     PCIBridge
>       PCI 15b3:1003
>         Net L#3 "ib0"
>         Net L#4 "ib1"
>         OpenFabrics L#5 "mlx4_0"
>     PCIBridge
>       PCI 102b:0522
>       PCI 19a2:0800
>     PCI 8086:1d02
>       Block L#6 "sr0"
> 
> 
> This email message is for the sole use of the intended recipient(s) and may 
> contain confidential information.  Any unauthorized review, use, disclosure 
> or distribution is prohibited.  If you are not the intended recipient, please 
> contact the sender by reply email and destroy all copies of the original 
> message.
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/08/17904.php

Reply via email to