although i see this with and without slurm, so there very well maybe
something wrong with my ompi compile

On Thu, Aug 25, 2016 at 2:04 PM, Michael Di Domenico
<[email protected]> wrote:
>
> I'm seeing this presently on our new cluster.  I'm not sure what's
> going on.  Did this every get resolved?
>
> I can confirm that we have compiled openmpi with the slurm options.
> we have other clusters which work fine, albeit this is our first
> mellanox based IB cluster, so i'm not sure if that has anything to do
> with it.  i am using the same openmpi install between clusters.
>
>>> >> On Apr 12, 2016, at 10:52 PM, Craig Yoshioka <[email protected]> wrote:
>>> >>
>>> >> Hi,
>>> >>
>>> >> I have a strange situation that I could use assistance with.  We
>>> >> recently rebooted some nodes in our Slurm cluster and after the reboot,
>>> >> running MPI programs on these nodes results in complaints from OpenMPI 
>>> >> about
>>> >> the Infiniband ports:
>>> >>
>>> >>
>>> >> --------------------------------------------------------------------------
>>> >> No OpenFabrics connection schemes reported that they were able to be
>>> >> used on a specific port.  As such, the openib BTL (OpenFabrics
>>> >> support) will be disabled for this port.
>>> >>
>>> >> Local host:           XXXXXXXXXX
>>> >> Local device:         mlx4_0
>>> >> Local port:           1
>>> >> CPCs attempted:       udcm
>>> >> —————————————————————————————————————
>>> >>
>>> >> [XXXXXXXXXXX][[7024,1],1][btl_openib_proc.c:157:mca_btl_openib_proc_create]
>>> >> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[7024,1],0]
>>> >>
>>> >> These nodes did receive some updates, but are otherwise all running the
>>> >> same version of Slurm (15.08.7) and OpenMPI (1.10.2).  The weird thing is
>>> >> that if I ssh into the affected nodes and use mpirun directly Infiniband
>>> >> works correctly.  So the problem definitely involves an interaction 
>>> >> between
>>> >> Slurm (maybe via PMI?) and OpenMPI.

Reply via email to