Ludovic, what happens here is that by default, a MPI task will only use the closest IB device. since tasks are bound to a socket, that means that tasks on socket 0 will only use mlx4_0, and tasks on socket 1 will only use mlx4_1. because these are on independent subnets, that also means that tasks on socket 0 cannot communicate with tasks on socket 1 via the openib btl.
so you have to explicitly direct Open MPI to use all the IB interfaces mpirun --mca btl_openib_ignore_locality 1 ... i do not think that will perform optimally though :-( for this type of settings, i'd rather suggest all IB ports are on the same subnet Cheers, Gilles On Wed, Jul 19, 2017 at 9:20 PM, Ludovic Raess <ludovic.ra...@unil.ch> wrote: > Hi, > > We have an issue on our 32 nodes Linux cluster regarding the usage of Open > MPI in a Infiniband dual-rail configuration. > > Node config: > - Supermicro dual socket Xeon E5 v3 6 cores CPUs > - 4 Titan X GPUs > - 2 IB Connect X FDR single port HCA (mlx4_0 and mlx4_1) > - Centos 6.6, OFED 3.1, openmpi 2.0.0, gcc 5.4, cuda 7 > > IB dual rail configuration: two independent IB switches (36 ports), each of > the two single port IB HCA is connected to its own IB subnet. > > The nodes are additionally connected via Ethernet for admin. > > ------------------------------------------------------------ > > Consider the node topology below as being valid for every of the 32 nodes > from the cluster: > > At the PCIe root complex level, each CPU manages two GPUs and a single IB > card : > CPU0 | CPU1 > mlx4_0 | mlx4_1 > GPU0 | GPU2 > GPU1 | GPU3 > > MPI ranks are bounded to a socket via a rankfile and are distributed on the > 2 sockets of each node : > rank 0=node01 slot=0:2 > rank 1=node01 slot=1:2 > rank 2=node02 slot=0:2 > ... > rank n=nodeNN slot=0,1:2 > > > case 1: with a single IB HCA used (any one of the two), all ranks can > communicate with each other via > openib only, and this independently of their relative socket > binding. The use of tcp btl can be > explicitly disabled as there is no tcp traffic. > > "mpirun -rf rankfile --mca btl_openib_if_include mlx4_0 --mca btl > self,openib a.out" > > case 2: in some rare cases, the topology of our MPI job is such that > processes on socket 0 communicate only with > other processes on socket 0 and the same is true for processes on > socket 1. In this context, the two IB rails > are effectively used in parallel and all ranks communicate as needed > via openib only, no tcp traffic. > > "mpirun -rf rankfile --mca btl_openib_if_include mlx4_0,mlx4_1 --mca > btl self,openib a.out" > > case 3: most of the time we have "cross socket" communications between ranks > on different nodes. > In this context Open MPI reverts to using tcp when communications > involve even and odd sockets, > and it slows down our jobs. > > mpirun -rf rankfile --mca btl_openib_if_include mlx4_0,mlx4_1 a.out > [node01.octopoda:16129] MCW rank 0 bound to socket 0[core 2[hwt 0]]: > [././B/././.][./././././.] > [node02.octopoda:12061] MCW rank 1 bound to socket 1[core 10[hwt 0]]: > [./././././.][././././B/.] > [node02.octopoda:12062] [rank=1] openib: skipping device mlx4_0; it is too > far away > [node01.octopoda:16130] [rank=0] openib: skipping device mlx4_1; it is too > far away > [node02.octopoda:12062] [rank=1] openib: using port mlx4_1:1 > [node01.octopoda:16130] [rank=0] openib: using port mlx4_0:1 > [node02.octopoda:12062] mca: bml: Using self btl to [[11337,1],1] on node > node02 > [node01.octopoda:16130] mca: bml: Using self btl to [[11337,1],0] on node > node01 > [node02.octopoda:12062] mca: bml: Using tcp btl to [[11337,1],0] on node > node01 > [node02.octopoda:12062] mca: bml: Using tcp btl to [[11337,1],0] on node > node01 > [node02.octopoda:12062] mca: bml: Using tcp btl to [[11337,1],0] on node > node01 > [node01.octopoda:16130] mca: bml: Using tcp btl to [[11337,1],1] on node > node02 > [node01.octopoda:16130] mca: bml: Using tcp btl to [[11337,1],1] on node > node02 > [node01.octopoda:16130] mca: bml: Using tcp btl to [[11337,1],1] on node > node02 > > > trying to force using the two IB HCA and to disable the use of tcp > btl results in the following error > > mpirun -rf rankfile --mca btl_openib_if_include mlx4_0,mlx4_1 --mca btl > self,openib a.out > [node02.octopoda:11818] MCW rank 1 bound to socket 1[core 10[hwt 0]]: > [./././././.][././././B/.] > [node01.octopoda:15886] MCW rank 0 bound to socket 0[core 2[hwt 0]]: > [././B/././.][./././././.] > [node01.octopoda:15887] [rank=0] openib: skipping device mlx4_1; it is too > far away > [node02.octopoda:11819] [rank=1] openib: skipping device mlx4_0; it is too > far away > [node01.octopoda:15887] [rank=0] openib: using port mlx4_0:1 > [node02.octopoda:11819] [rank=1] openib: using port mlx4_1:1 > [node02.octopoda:11819] mca: bml: Using self btl to [[25017,1],1] on node > node02 > [node01.octopoda:15887] mca: bml: Using self btl to [[25017,1],0] on node > node01 > ------------------------------------- > At least one pair of MPI processes are unable to reach each other for > MPI communications. This means that no Open MPI device has indicated > that it can be used to communicate between these processes. This is > an error; Open MPI requires that all MPI processes be able to reach > each other. This error can sometimes be the result of forgetting to > specify the "self" BTL. > > Process 1 ([[25017,1],1]) is on host: node02 > Process 2 ([[25017,1],0]) is on host: node01 > BTLs attempted: self openib > > Your MPI job is now going to abort; sorry. > ------------------------------------- > > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users _______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users