Maybe a dumb question, but one possibility is that the requesting client
(the rebooted node) is failing to complete it's MGID join(), so have you
verified with ib_ping and checked the sm log (wherever your fabric master
sits) for IB partial join refuse's?

On Thu, Aug 25, 2016 at 1:04 PM, Michael Di Domenico <[email protected]
> wrote:

>
> I'm seeing this presently on our new cluster.  I'm not sure what's
> going on.  Did this every get resolved?
>
> I can confirm that we have compiled openmpi with the slurm options.
> we have other clusters which work fine, albeit this is our first
> mellanox based IB cluster, so i'm not sure if that has anything to do
> with it.  i am using the same openmpi install between clusters.
>
>
>
> On Wed, Apr 13, 2016 at 2:51 AM, Lachlan Musicman <[email protected]>
> wrote:
> > I was reading about this today. Isn't OpenMPI compiled --with-slurm by
> > default when installing with one of the pkg managers?
> >
> > https://www.open-mpi.org/faq/?category=building#build-rte
> >
> > Cheers
> > L.
> >
> > ------
> > The most dangerous phrase in the language is, "We've always done it this
> > way."
> >
> > - Grace Hopper
> >
> > On 13 April 2016 at 16:30, Craig Yoshioka <[email protected]> wrote:
> >>
> >>
> >> Thanks, I'll add that to my list of things to try. I did use --with-pmi
> >> but not --with-slurm.
> >>
> >> Sent from my iPhone
> >>
> >> > On Apr 12, 2016, at 11:19 PM, Jordan Willis <[email protected]>
> >> > wrote:
> >> >
> >> >
> >> > Have you tried recompiling openmpi with the —with-slurm option? That
> did
> >> > the trick for me
> >> >
> >> >
> >> >> On Apr 12, 2016, at 10:52 PM, Craig Yoshioka <[email protected]>
> wrote:
> >> >>
> >> >> Hi,
> >> >>
> >> >> I have a strange situation that I could use assistance with.  We
> >> >> recently rebooted some nodes in our Slurm cluster and after the
> reboot,
> >> >> running MPI programs on these nodes results in complaints from
> OpenMPI about
> >> >> the Infiniband ports:
> >> >>
> >> >>
> >> >> ------------------------------------------------------------
> --------------
> >> >> No OpenFabrics connection schemes reported that they were able to be
> >> >> used on a specific port.  As such, the openib BTL (OpenFabrics
> >> >> support) will be disabled for this port.
> >> >>
> >> >> Local host:           XXXXXXXXXX
> >> >> Local device:         mlx4_0
> >> >> Local port:           1
> >> >> CPCs attempted:       udcm
> >> >> —————————————————————————————————————
> >> >>
> >> >> [XXXXXXXXXXX][[7024,1],1][btl_openib_proc.c:157:mca_btl_
> openib_proc_create]
> >> >> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[7024,1],0]
> >> >>
> >> >> These nodes did receive some updates, but are otherwise all running
> the
> >> >> same version of Slurm (15.08.7) and OpenMPI (1.10.2).  The weird
> thing is
> >> >> that if I ssh into the affected nodes and use mpirun directly
> Infiniband
> >> >> works correctly.  So the problem definitely involves an interaction
> between
> >> >> Slurm (maybe via PMI?) and OpenMPI.
> >> >>
> >> >> Any thoughts?
> >> >>
> >> >> Thanks!,
> >> >> -Craig
> >> >>
> >
> >

Reply via email to