Have you tried recompiling openmpi with the —with-slurm option? That did the trick for me
> On Apr 12, 2016, at 10:52 PM, Craig Yoshioka <[email protected]> wrote: > > Hi, > > I have a strange situation that I could use assistance with. We recently > rebooted some nodes in our Slurm cluster and after the reboot, running MPI > programs on these nodes results in complaints from OpenMPI about the > Infiniband ports: > > -------------------------------------------------------------------------- > No OpenFabrics connection schemes reported that they were able to be > used on a specific port. As such, the openib BTL (OpenFabrics > support) will be disabled for this port. > > Local host: XXXXXXXXXX > Local device: mlx4_0 > Local port: 1 > CPCs attempted: udcm > ————————————————————————————————————— > [XXXXXXXXXXX][[7024,1],1][btl_openib_proc.c:157:mca_btl_openib_proc_create] > [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[7024,1],0] > > These nodes did receive some updates, but are otherwise all running the same > version of Slurm (15.08.7) and OpenMPI (1.10.2). The weird thing is that if > I ssh into the affected nodes and use mpirun directly Infiniband works > correctly. So the problem definitely involves an interaction between Slurm > (maybe via PMI?) and OpenMPI. > > Any thoughts? > > Thanks!, > -Craig >
