Maybe a dumb question, but one possibility is that the requesting client (the rebooted node) is failing to complete it's MGID join(), so have you verified with ib_ping and checked the sm log (wherever your fabric master sits) for IB partial join refuse's?
On Thu, Aug 25, 2016 at 1:04 PM, Michael Di Domenico <[email protected] > wrote: > > I'm seeing this presently on our new cluster. I'm not sure what's > going on. Did this every get resolved? > > I can confirm that we have compiled openmpi with the slurm options. > we have other clusters which work fine, albeit this is our first > mellanox based IB cluster, so i'm not sure if that has anything to do > with it. i am using the same openmpi install between clusters. > > > > On Wed, Apr 13, 2016 at 2:51 AM, Lachlan Musicman <[email protected]> > wrote: > > I was reading about this today. Isn't OpenMPI compiled --with-slurm by > > default when installing with one of the pkg managers? > > > > https://www.open-mpi.org/faq/?category=building#build-rte > > > > Cheers > > L. > > > > ------ > > The most dangerous phrase in the language is, "We've always done it this > > way." > > > > - Grace Hopper > > > > On 13 April 2016 at 16:30, Craig Yoshioka <[email protected]> wrote: > >> > >> > >> Thanks, I'll add that to my list of things to try. I did use --with-pmi > >> but not --with-slurm. > >> > >> Sent from my iPhone > >> > >> > On Apr 12, 2016, at 11:19 PM, Jordan Willis <[email protected]> > >> > wrote: > >> > > >> > > >> > Have you tried recompiling openmpi with the —with-slurm option? That > did > >> > the trick for me > >> > > >> > > >> >> On Apr 12, 2016, at 10:52 PM, Craig Yoshioka <[email protected]> > wrote: > >> >> > >> >> Hi, > >> >> > >> >> I have a strange situation that I could use assistance with. We > >> >> recently rebooted some nodes in our Slurm cluster and after the > reboot, > >> >> running MPI programs on these nodes results in complaints from > OpenMPI about > >> >> the Infiniband ports: > >> >> > >> >> > >> >> ------------------------------------------------------------ > -------------- > >> >> No OpenFabrics connection schemes reported that they were able to be > >> >> used on a specific port. As such, the openib BTL (OpenFabrics > >> >> support) will be disabled for this port. > >> >> > >> >> Local host: XXXXXXXXXX > >> >> Local device: mlx4_0 > >> >> Local port: 1 > >> >> CPCs attempted: udcm > >> >> ————————————————————————————————————— > >> >> > >> >> [XXXXXXXXXXX][[7024,1],1][btl_openib_proc.c:157:mca_btl_ > openib_proc_create] > >> >> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[7024,1],0] > >> >> > >> >> These nodes did receive some updates, but are otherwise all running > the > >> >> same version of Slurm (15.08.7) and OpenMPI (1.10.2). The weird > thing is > >> >> that if I ssh into the affected nodes and use mpirun directly > Infiniband > >> >> works correctly. So the problem definitely involves an interaction > between > >> >> Slurm (maybe via PMI?) and OpenMPI. > >> >> > >> >> Any thoughts? > >> >> > >> >> Thanks!, > >> >> -Craig > >> >> > > > >
