Have you tried recompiling openmpi with the —with-slurm option? That did the 
trick for me


> On Apr 12, 2016, at 10:52 PM, Craig Yoshioka <[email protected]> wrote:
> 
> Hi,
> 
> I have a strange situation that I could use assistance with.  We recently 
> rebooted some nodes in our Slurm cluster and after the reboot, running MPI 
> programs on these nodes results in complaints from OpenMPI about the 
> Infiniband ports:
> 
> --------------------------------------------------------------------------
> No OpenFabrics connection schemes reported that they were able to be
> used on a specific port.  As such, the openib BTL (OpenFabrics
> support) will be disabled for this port.
> 
>  Local host:           XXXXXXXXXX
>  Local device:         mlx4_0
>  Local port:           1
>  CPCs attempted:       udcm
> —————————————————————————————————————
> [XXXXXXXXXXX][[7024,1],1][btl_openib_proc.c:157:mca_btl_openib_proc_create] 
> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[7024,1],0]
> 
> These nodes did receive some updates, but are otherwise all running the same 
> version of Slurm (15.08.7) and OpenMPI (1.10.2).  The weird thing is that if 
> I ssh into the affected nodes and use mpirun directly Infiniband works 
> correctly.  So the problem definitely involves an interaction between Slurm 
> (maybe via PMI?) and OpenMPI.
> 
> Any thoughts?
> 
> Thanks!,
> -Craig
> 

Reply via email to