Check your IB setup, Michael - you probably don’t have UD enabled on it
> On Aug 25, 2016, at 11:42 AM, Michael Di Domenico <[email protected]> > wrote: > > > although i see this with and without slurm, so there very well maybe > something wrong with my ompi compile > > On Thu, Aug 25, 2016 at 2:04 PM, Michael Di Domenico > <[email protected]> wrote: >> >> I'm seeing this presently on our new cluster. I'm not sure what's >> going on. Did this every get resolved? >> >> I can confirm that we have compiled openmpi with the slurm options. >> we have other clusters which work fine, albeit this is our first >> mellanox based IB cluster, so i'm not sure if that has anything to do >> with it. i am using the same openmpi install between clusters. >> >>>>>> On Apr 12, 2016, at 10:52 PM, Craig Yoshioka <[email protected]> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> I have a strange situation that I could use assistance with. We >>>>>> recently rebooted some nodes in our Slurm cluster and after the reboot, >>>>>> running MPI programs on these nodes results in complaints from OpenMPI >>>>>> about >>>>>> the Infiniband ports: >>>>>> >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> No OpenFabrics connection schemes reported that they were able to be >>>>>> used on a specific port. As such, the openib BTL (OpenFabrics >>>>>> support) will be disabled for this port. >>>>>> >>>>>> Local host: XXXXXXXXXX >>>>>> Local device: mlx4_0 >>>>>> Local port: 1 >>>>>> CPCs attempted: udcm >>>>>> ————————————————————————————————————— >>>>>> >>>>>> [XXXXXXXXXXX][[7024,1],1][btl_openib_proc.c:157:mca_btl_openib_proc_create] >>>>>> [btl_openib_proc.c:157] ompi_modex_recv failed for peer [[7024,1],0] >>>>>> >>>>>> These nodes did receive some updates, but are otherwise all running the >>>>>> same version of Slurm (15.08.7) and OpenMPI (1.10.2). The weird thing is >>>>>> that if I ssh into the affected nodes and use mpirun directly Infiniband >>>>>> works correctly. So the problem definitely involves an interaction >>>>>> between >>>>>> Slurm (maybe via PMI?) and OpenMPI.
