Hi,

I have a strange situation that I could use assistance with.  We recently 
rebooted some nodes in our Slurm cluster and after the reboot, running MPI 
programs on these nodes results in complaints from OpenMPI about the Infiniband 
ports:

--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           XXXXXXXXXX
  Local device:         mlx4_0
  Local port:           1
  CPCs attempted:       udcm
—————————————————————————————————————
[XXXXXXXXXXX][[7024,1],1][btl_openib_proc.c:157:mca_btl_openib_proc_create] 
[btl_openib_proc.c:157] ompi_modex_recv failed for peer [[7024,1],0]

These nodes did receive some updates, but are otherwise all running the same 
version of Slurm (15.08.7) and OpenMPI (1.10.2).  The weird thing is that if I 
ssh into the affected nodes and use mpirun directly Infiniband works correctly. 
 So the problem definitely involves an interaction between Slurm (maybe via 
PMI?) and OpenMPI.

Any thoughts?

Thanks!,
-Craig

Reply via email to