although i see this with and without slurm, so there very well maybe something wrong with my ompi compile
On Thu, Aug 25, 2016 at 2:04 PM, Michael Di Domenico <[email protected]> wrote: > > I'm seeing this presently on our new cluster. I'm not sure what's > going on. Did this every get resolved? > > I can confirm that we have compiled openmpi with the slurm options. > we have other clusters which work fine, albeit this is our first > mellanox based IB cluster, so i'm not sure if that has anything to do > with it. i am using the same openmpi install between clusters. > >>> >> On Apr 12, 2016, at 10:52 PM, Craig Yoshioka <[email protected]> wrote: >>> >> >>> >> Hi, >>> >> >>> >> I have a strange situation that I could use assistance with. We >>> >> recently rebooted some nodes in our Slurm cluster and after the reboot, >>> >> running MPI programs on these nodes results in complaints from OpenMPI >>> >> about >>> >> the Infiniband ports: >>> >> >>> >> >>> >> -------------------------------------------------------------------------- >>> >> No OpenFabrics connection schemes reported that they were able to be >>> >> used on a specific port. As such, the openib BTL (OpenFabrics >>> >> support) will be disabled for this port. >>> >> >>> >> Local host: XXXXXXXXXX >>> >> Local device: mlx4_0 >>> >> Local port: 1 >>> >> CPCs attempted: udcm >>> >> ————————————————————————————————————— >>> >> >>> >> [XXXXXXXXXXX][[7024,1],1][btl_openib_proc.c:157:mca_btl_openib_proc_create] >>> >> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[7024,1],0] >>> >> >>> >> These nodes did receive some updates, but are otherwise all running the >>> >> same version of Slurm (15.08.7) and OpenMPI (1.10.2). The weird thing is >>> >> that if I ssh into the affected nodes and use mpirun directly Infiniband >>> >> works correctly. So the problem definitely involves an interaction >>> >> between >>> >> Slurm (maybe via PMI?) and OpenMPI.
