I have a 4-socket machine with two dual-port Infiniband cards (devices
mlx4_0 and mlx4_1). The cards are conneted to PCI slots of different
CPUs (I hope..), both ports are active on both cards, everything
connected to the same physical network.
I use openmpi-1.10.0 and run the IBM-MPI1 benchmark with 4 MPI ranks
bound to the 4 sockets, hoping to use both IB cards (and both ports):
mpirun --map-by socket --bind-to core -np 4 --mca btl openib,self
--mca btl_openib_if_include mlx4_0,mlx4_1 ./IMB-MPI1 SendRecv
but OpenMPI refuses to use the mlx4_1 device
[node1.local:28265] [rank=0] openib: skipping device mlx4_1; it is
too far away
[ the same for other ranks ]
This is confusing, since I have read that OpenMPI automatically uses a
closer HCA, so at least some (>=one) rank should choose mlx4_1. I use
binding by socket, here is the reported map:
[node1.local:28263] MCW rank 2 bound to socket 2[core 24[hwt 0]]:
[./././././././././././.][./././././././././././.][B/././././././././././.][./././././././././././.]
[node1.local:28263] MCW rank 3 bound to socket 3[core 36[hwt 0]]:
[./././././././././././.][./././././././././././.][./././././././././././.][B/././././././././././.]
[node1.local:28263] MCW rank 0 bound to socket 0[core 0[hwt 0]]:
[B/././././././././././.][./././././././././././.][./././././././././././.][./././././././././././.]
[node1.local:28263] MCW rank 1 bound to socket 1[core 12[hwt 0]]:
[./././././././././././.][B/././././././././././.][./././././././././././.][./././././././././././.]
To check what's going on I have modified btl_openib_component.c to print
the computed distances.
opal_output_verbose(1, ompi_btl_base_framework.framework_output,
"[rank=%d] openib: device %d/%d distance %lf",
ORTE_PROC_MY_NAME->vpid,
(int)i, (int)num_devs,
(double)dev_sorted[i].distance);
Here is what I get:
[node1.local:28265] [rank=0] openib: device 0/2 distance 0.000000
[node1.local:28266] [rank=1] openib: device 0/2 distance 0.000000
[node1.local:28267] [rank=2] openib: device 0/2 distance 0.000000
[node1.local:28268] [rank=3] openib: device 0/2 distance 0.000000
[node1.local:28265] [rank=0] openib: device 1/2 distance 2.100000
[node1.local:28266] [rank=1] openib: device 1/2 distance 1.000000
[node1.local:28267] [rank=2] openib: device 1/2 distance 2.100000
[node1.local:28268] [rank=3] openib: device 1/2 distance 2.100000
So the computed distance for mlx4_0 is 0 on all ranks. I believe this
should not be so. The distance should be smaller on 1 rank and larger
for 3 others, as is the case for mlx4_1. Looks like a bug?
Another question is, In my configuration two ranks will have a 'closer'
IB card, but two others will not. Since the correct distance to both
devices will likely be equal, which device will they choose, if they do
that automatically? I'd rather they didn't both choose mlx4_0.. I guess
it would be nice if I could by hand specify the device/port, which
should be used by a given MPI rank. Is this (going to be) possible with
OpenMPI?
Thanks a lot,
Marcin