Ludovic,

what happens here is that by default, a MPI task will only use the
closest IB device.
since tasks are bound to a socket, that means that tasks on socket 0
will only use mlx4_0, and tasks on socket 1 will only use mlx4_1.
because these are on independent subnets, that also means that tasks
on socket 0 cannot communicate with tasks on socket 1 via the openib
btl.

so you have to explicitly direct Open MPI to use all the IB interfaces

mpirun --mca btl_openib_ignore_locality 1 ...

i do not think that will perform optimally though :-(
for this type of settings, i'd rather suggest all IB ports are on the
same subnet


Cheers,

Gilles

On Wed, Jul 19, 2017 at 9:20 PM, Ludovic Raess <ludovic.ra...@unil.ch> wrote:
> Hi,
>
> We have an issue on our 32 nodes Linux cluster regarding the usage of Open
> MPI in a Infiniband dual-rail configuration.
>
> Node config:
> - Supermicro dual socket Xeon E5 v3 6 cores CPUs
> - 4 Titan X GPUs
> - 2 IB Connect X FDR single port HCA (mlx4_0 and mlx4_1)
> - Centos 6.6, OFED 3.1, openmpi 2.0.0, gcc 5.4, cuda 7
>
> IB dual rail configuration: two independent IB switches (36 ports), each of
> the two single port IB HCA is connected to its own IB subnet.
>
> The nodes are additionally connected via Ethernet for admin.
>
> ------------------------------------------------------------
>
> Consider the node topology below as being valid for every of the 32 nodes
> from the cluster:
>
> At the PCIe root complex level, each CPU manages two GPUs and a single IB
> card :
> CPU0     |    CPU1
> mlx4_0   |    mlx4_1
> GPU0     |    GPU2
> GPU1     |    GPU3
>
> MPI ranks are bounded to a socket via a rankfile and are distributed on the
> 2 sockets of each node :
> rank 0=node01 slot=0:2
> rank 1=node01 slot=1:2
> rank 2=node02 slot=0:2
> ...
> rank n=nodeNN slot=0,1:2
>
>
> case 1: with a single IB HCA used (any one of the two), all ranks can
> communicate with each other via
>         openib only, and this independently of their relative socket
> binding. The use of tcp btl can be
>         explicitly disabled as there is no tcp traffic.
>
>         "mpirun -rf rankfile --mca btl_openib_if_include mlx4_0 --mca btl
> self,openib a.out"
>
> case 2: in some rare cases, the topology of our MPI job is such that
> processes on socket 0 communicate only with
>         other processes on socket 0 and the same is true for processes on
> socket 1. In this context, the two IB rails
>         are effectively used in parallel and all ranks communicate as needed
> via openib only, no tcp traffic.
>
>         "mpirun -rf rankfile --mca btl_openib_if_include mlx4_0,mlx4_1 --mca
> btl self,openib a.out"
>
> case 3: most of the time we have "cross socket" communications between ranks
> on different nodes.
>         In this context Open MPI reverts to using tcp when communications
> involve even and odd sockets,
>         and it slows down our jobs.
>
> mpirun -rf rankfile --mca btl_openib_if_include mlx4_0,mlx4_1 a.out
> [node01.octopoda:16129] MCW rank 0 bound to socket 0[core 2[hwt 0]]:
> [././B/././.][./././././.]
> [node02.octopoda:12061] MCW rank 1 bound to socket 1[core 10[hwt 0]]:
> [./././././.][././././B/.]
> [node02.octopoda:12062] [rank=1] openib: skipping device mlx4_0; it is too
> far away
> [node01.octopoda:16130] [rank=0] openib: skipping device mlx4_1; it is too
> far away
> [node02.octopoda:12062] [rank=1] openib: using port mlx4_1:1
> [node01.octopoda:16130] [rank=0] openib: using port mlx4_0:1
> [node02.octopoda:12062] mca: bml: Using self btl to [[11337,1],1] on node
> node02
> [node01.octopoda:16130] mca: bml: Using self btl to [[11337,1],0] on node
> node01
> [node02.octopoda:12062] mca: bml: Using tcp btl to [[11337,1],0] on node
> node01
> [node02.octopoda:12062] mca: bml: Using tcp btl to [[11337,1],0] on node
> node01
> [node02.octopoda:12062] mca: bml: Using tcp btl to [[11337,1],0] on node
> node01
> [node01.octopoda:16130] mca: bml: Using tcp btl to [[11337,1],1] on node
> node02
> [node01.octopoda:16130] mca: bml: Using tcp btl to [[11337,1],1] on node
> node02
> [node01.octopoda:16130] mca: bml: Using tcp btl to [[11337,1],1] on node
> node02
>
>
>         trying to force using the two IB HCA and to disable the use of tcp
> btl results in the following error
>
> mpirun -rf rankfile --mca btl_openib_if_include mlx4_0,mlx4_1 --mca btl
> self,openib a.out
> [node02.octopoda:11818] MCW rank 1 bound to socket 1[core 10[hwt 0]]:
> [./././././.][././././B/.]
> [node01.octopoda:15886] MCW rank 0 bound to socket 0[core 2[hwt 0]]:
> [././B/././.][./././././.]
> [node01.octopoda:15887] [rank=0] openib: skipping device mlx4_1; it is too
> far away
> [node02.octopoda:11819] [rank=1] openib: skipping device mlx4_0; it is too
> far away
> [node01.octopoda:15887] [rank=0] openib: using port mlx4_0:1
> [node02.octopoda:11819] [rank=1] openib: using port mlx4_1:1
> [node02.octopoda:11819] mca: bml: Using self btl to [[25017,1],1] on node
> node02
> [node01.octopoda:15887] mca: bml: Using self btl to [[25017,1],0] on node
> node01
> -------------------------------------
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>
>   Process 1 ([[25017,1],1]) is on host: node02
>   Process 2 ([[25017,1],0]) is on host: node01
>   BTLs attempted: self openib
>
> Your MPI job is now going to abort; sorry.
> -------------------------------------
>
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to