Thanks, I'll add that to my list of things to try. I did use --with-pmi but not 
--with-slurm. 

Sent from my iPhone

> On Apr 12, 2016, at 11:19 PM, Jordan Willis <[email protected]> wrote:
> 
> 
> Have you tried recompiling openmpi with the —with-slurm option? That did the 
> trick for me
> 
> 
>> On Apr 12, 2016, at 10:52 PM, Craig Yoshioka <[email protected]> wrote:
>> 
>> Hi,
>> 
>> I have a strange situation that I could use assistance with.  We recently 
>> rebooted some nodes in our Slurm cluster and after the reboot, running MPI 
>> programs on these nodes results in complaints from OpenMPI about the 
>> Infiniband ports:
>> 
>> --------------------------------------------------------------------------
>> No OpenFabrics connection schemes reported that they were able to be
>> used on a specific port.  As such, the openib BTL (OpenFabrics
>> support) will be disabled for this port.
>> 
>> Local host:           XXXXXXXXXX
>> Local device:         mlx4_0
>> Local port:           1
>> CPCs attempted:       udcm
>> —————————————————————————————————————
>> [XXXXXXXXXXX][[7024,1],1][btl_openib_proc.c:157:mca_btl_openib_proc_create] 
>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[7024,1],0]
>> 
>> These nodes did receive some updates, but are otherwise all running the same 
>> version of Slurm (15.08.7) and OpenMPI (1.10.2).  The weird thing is that if 
>> I ssh into the affected nodes and use mpirun directly Infiniband works 
>> correctly.  So the problem definitely involves an interaction between Slurm 
>> (maybe via PMI?) and OpenMPI.
>> 
>> Any thoughts?
>> 
>> Thanks!,
>> -Craig
>> 

Reply via email to