Hi, I have a strange situation that I could use assistance with. We recently rebooted some nodes in our Slurm cluster and after the reboot, running MPI programs on these nodes results in complaints from OpenMPI about the Infiniband ports:
-------------------------------------------------------------------------- No OpenFabrics connection schemes reported that they were able to be used on a specific port. As such, the openib BTL (OpenFabrics support) will be disabled for this port. Local host: XXXXXXXXXX Local device: mlx4_0 Local port: 1 CPCs attempted: udcm ————————————————————————————————————— [XXXXXXXXXXX][[7024,1],1][btl_openib_proc.c:157:mca_btl_openib_proc_create] [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[7024,1],0] These nodes did receive some updates, but are otherwise all running the same version of Slurm (15.08.7) and OpenMPI (1.10.2). The weird thing is that if I ssh into the affected nodes and use mpirun directly Infiniband works correctly. So the problem definitely involves an interaction between Slurm (maybe via PMI?) and OpenMPI. Any thoughts? Thanks!, -Craig
