I'm seeing this presently on our new cluster.  I'm not sure what's
going on.  Did this every get resolved?

I can confirm that we have compiled openmpi with the slurm options.
we have other clusters which work fine, albeit this is our first
mellanox based IB cluster, so i'm not sure if that has anything to do
with it.  i am using the same openmpi install between clusters.



On Wed, Apr 13, 2016 at 2:51 AM, Lachlan Musicman <[email protected]> wrote:
> I was reading about this today. Isn't OpenMPI compiled --with-slurm by
> default when installing with one of the pkg managers?
>
> https://www.open-mpi.org/faq/?category=building#build-rte
>
> Cheers
> L.
>
> ------
> The most dangerous phrase in the language is, "We've always done it this
> way."
>
> - Grace Hopper
>
> On 13 April 2016 at 16:30, Craig Yoshioka <[email protected]> wrote:
>>
>>
>> Thanks, I'll add that to my list of things to try. I did use --with-pmi
>> but not --with-slurm.
>>
>> Sent from my iPhone
>>
>> > On Apr 12, 2016, at 11:19 PM, Jordan Willis <[email protected]>
>> > wrote:
>> >
>> >
>> > Have you tried recompiling openmpi with the —with-slurm option? That did
>> > the trick for me
>> >
>> >
>> >> On Apr 12, 2016, at 10:52 PM, Craig Yoshioka <[email protected]> wrote:
>> >>
>> >> Hi,
>> >>
>> >> I have a strange situation that I could use assistance with.  We
>> >> recently rebooted some nodes in our Slurm cluster and after the reboot,
>> >> running MPI programs on these nodes results in complaints from OpenMPI 
>> >> about
>> >> the Infiniband ports:
>> >>
>> >>
>> >> --------------------------------------------------------------------------
>> >> No OpenFabrics connection schemes reported that they were able to be
>> >> used on a specific port.  As such, the openib BTL (OpenFabrics
>> >> support) will be disabled for this port.
>> >>
>> >> Local host:           XXXXXXXXXX
>> >> Local device:         mlx4_0
>> >> Local port:           1
>> >> CPCs attempted:       udcm
>> >> —————————————————————————————————————
>> >>
>> >> [XXXXXXXXXXX][[7024,1],1][btl_openib_proc.c:157:mca_btl_openib_proc_create]
>> >> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[7024,1],0]
>> >>
>> >> These nodes did receive some updates, but are otherwise all running the
>> >> same version of Slurm (15.08.7) and OpenMPI (1.10.2).  The weird thing is
>> >> that if I ssh into the affected nodes and use mpirun directly Infiniband
>> >> works correctly.  So the problem definitely involves an interaction 
>> >> between
>> >> Slurm (maybe via PMI?) and OpenMPI.
>> >>
>> >> Any thoughts?
>> >>
>> >> Thanks!,
>> >> -Craig
>> >>
>
>

Reply via email to