I was reading about this today. Isn't OpenMPI compiled --with-slurm by default when installing with one of the pkg managers?
https://www.open-mpi.org/faq/?category=building#build-rte Cheers L. ------ The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 13 April 2016 at 16:30, Craig Yoshioka <[email protected]> wrote: > > Thanks, I'll add that to my list of things to try. I did use --with-pmi > but not --with-slurm. > > Sent from my iPhone > > > On Apr 12, 2016, at 11:19 PM, Jordan Willis <[email protected]> > wrote: > > > > > > Have you tried recompiling openmpi with the —with-slurm option? That did > the trick for me > > > > > >> On Apr 12, 2016, at 10:52 PM, Craig Yoshioka <[email protected]> wrote: > >> > >> Hi, > >> > >> I have a strange situation that I could use assistance with. We > recently rebooted some nodes in our Slurm cluster and after the reboot, > running MPI programs on these nodes results in complaints from OpenMPI > about the Infiniband ports: > >> > >> > -------------------------------------------------------------------------- > >> No OpenFabrics connection schemes reported that they were able to be > >> used on a specific port. As such, the openib BTL (OpenFabrics > >> support) will be disabled for this port. > >> > >> Local host: XXXXXXXXXX > >> Local device: mlx4_0 > >> Local port: 1 > >> CPCs attempted: udcm > >> ————————————————————————————————————— > >> > [XXXXXXXXXXX][[7024,1],1][btl_openib_proc.c:157:mca_btl_openib_proc_create] > [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[7024,1],0] > >> > >> These nodes did receive some updates, but are otherwise all running the > same version of Slurm (15.08.7) and OpenMPI (1.10.2). The weird thing is > that if I ssh into the affected nodes and use mpirun directly Infiniband > works correctly. So the problem definitely involves an interaction between > Slurm (maybe via PMI?) and OpenMPI. > >> > >> Any thoughts? > >> > >> Thanks!, > >> -Craig > >> >
