Check your IB setup, Michael - you probably don’t have UD enabled on it

> On Aug 25, 2016, at 11:42 AM, Michael Di Domenico <[email protected]> 
> wrote:
> 
> 
> although i see this with and without slurm, so there very well maybe
> something wrong with my ompi compile
> 
> On Thu, Aug 25, 2016 at 2:04 PM, Michael Di Domenico
> <[email protected]> wrote:
>> 
>> I'm seeing this presently on our new cluster.  I'm not sure what's
>> going on.  Did this every get resolved?
>> 
>> I can confirm that we have compiled openmpi with the slurm options.
>> we have other clusters which work fine, albeit this is our first
>> mellanox based IB cluster, so i'm not sure if that has anything to do
>> with it.  i am using the same openmpi install between clusters.
>> 
>>>>>> On Apr 12, 2016, at 10:52 PM, Craig Yoshioka <[email protected]> wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> I have a strange situation that I could use assistance with.  We
>>>>>> recently rebooted some nodes in our Slurm cluster and after the reboot,
>>>>>> running MPI programs on these nodes results in complaints from OpenMPI 
>>>>>> about
>>>>>> the Infiniband ports:
>>>>>> 
>>>>>> 
>>>>>> --------------------------------------------------------------------------
>>>>>> No OpenFabrics connection schemes reported that they were able to be
>>>>>> used on a specific port.  As such, the openib BTL (OpenFabrics
>>>>>> support) will be disabled for this port.
>>>>>> 
>>>>>> Local host:           XXXXXXXXXX
>>>>>> Local device:         mlx4_0
>>>>>> Local port:           1
>>>>>> CPCs attempted:       udcm
>>>>>> —————————————————————————————————————
>>>>>> 
>>>>>> [XXXXXXXXXXX][[7024,1],1][btl_openib_proc.c:157:mca_btl_openib_proc_create]
>>>>>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[7024,1],0]
>>>>>> 
>>>>>> These nodes did receive some updates, but are otherwise all running the
>>>>>> same version of Slurm (15.08.7) and OpenMPI (1.10.2).  The weird thing is
>>>>>> that if I ssh into the affected nodes and use mpirun directly Infiniband
>>>>>> works correctly.  So the problem definitely involves an interaction 
>>>>>> between
>>>>>> Slurm (maybe via PMI?) and OpenMPI.

Reply via email to