Thanks, I'll add that to my list of things to try. I did use --with-pmi but not --with-slurm.
Sent from my iPhone > On Apr 12, 2016, at 11:19 PM, Jordan Willis <[email protected]> wrote: > > > Have you tried recompiling openmpi with the —with-slurm option? That did the > trick for me > > >> On Apr 12, 2016, at 10:52 PM, Craig Yoshioka <[email protected]> wrote: >> >> Hi, >> >> I have a strange situation that I could use assistance with. We recently >> rebooted some nodes in our Slurm cluster and after the reboot, running MPI >> programs on these nodes results in complaints from OpenMPI about the >> Infiniband ports: >> >> -------------------------------------------------------------------------- >> No OpenFabrics connection schemes reported that they were able to be >> used on a specific port. As such, the openib BTL (OpenFabrics >> support) will be disabled for this port. >> >> Local host: XXXXXXXXXX >> Local device: mlx4_0 >> Local port: 1 >> CPCs attempted: udcm >> ————————————————————————————————————— >> [XXXXXXXXXXX][[7024,1],1][btl_openib_proc.c:157:mca_btl_openib_proc_create] >> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[7024,1],0] >> >> These nodes did receive some updates, but are otherwise all running the same >> version of Slurm (15.08.7) and OpenMPI (1.10.2). The weird thing is that if >> I ssh into the affected nodes and use mpirun directly Infiniband works >> correctly. So the problem definitely involves an interaction between Slurm >> (maybe via PMI?) and OpenMPI. >> >> Any thoughts? >> >> Thanks!, >> -Craig >>
