Tony, start at a low level. Is the Infiniband fabric healthy?
Run
ibstatus   on every node
sminfo on one node
ibdiagnet on one node

On Sun, 23 Aug 2020 at 05:02, Tony Ladd via users <users@lists.open-mpi.org>
wrote:

> Hi Jeff
>
> I installed ucx as you suggested. But I can't get even the simplest code
> (ucp_client_server) to work across the network. I can compile openMPI
> with UCX but it has the same problem - mpi codes will not execute and
> there are no messages. Really, UCX is not helping. It is adding another
> (not so well documented) software layer, which does not offer better
> diagnostics as far as I can see. Its also unclear to me how to control
> what drivers are being loaded - UCX wants to make that decision for you.
> With openMPI I can see that (for instance) the tcp module works both
> locally and over the network - it must be using the Mellanox NIC for the
> bandwidth it is reporting on IMB-MPI1 even with tcp protocols. But if I
> try to use openib (or allow ucx or openmpi to choose the transport
> layer) it just hangs. Annoyingly I have this server where everything
> works just fine - I can run locally over openib and its fine. All the
> other nodes cannot seem to load openib so even local jobs fail.
>
> The only good (as best I can tell) diagnostic is from openMPI. ibv_obj
> (from v2.x) complains  that openib returns a NULL object, whereas on my
> server it returns logical_index=1. Can we not try to diagnose the
> problem with openib not loading (see my original post for details). I am
> pretty sure if we can that would fix the problem.
>
> Thanks
>
> Tony
>
> PS I tried configuring two nodes back to back to see if it was a switch
> issue, but the result was the same.
>
>
> On 8/19/20 1:27 PM, Jeff Squyres (jsquyres) wrote:
> > [External Email]
> >
> > Tony --
> >
> > Have you tried compiling Open MPI with UCX support?  This is Mellanox
> (NVIDIA's) preferred mechanism for InfiniBand support these days -- the
> openib BTL is legacy.
> >
> > You can run: mpirun --mca pml ucx ...
> >
> >
> >> On Aug 19, 2020, at 12:46 PM, Tony Ladd via users <
> users@lists.open-mpi.org> wrote:
> >>
> >> One other update. I compiled OpenMPI-4.0.4 The outcome was the same but
> there is no mention of ibv_obj this time.
> >>
> >> Tony
> >>
> >> --
> >>
> >> Tony Ladd
> >>
> >> Chemical Engineering Department
> >> University of Florida
> >> Gainesville, Florida 32611-6005
> >> USA
> >>
> >> Email: tladd-"(AT)"-che.ufl.edu
> >> Web    http://ladd.che.ufl.edu
> >>
> >> Tel:   (352)-392-6509
> >> FAX:   (352)-392-9514
> >>
> >> <outf34-4.0><outfoam-4.0>
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> >
> --
> Tony Ladd
>
> Chemical Engineering Department
> University of Florida
> Gainesville, Florida 32611-6005
> USA
>
> Email: tladd-"(AT)"-che.ufl.edu
> Web    http://ladd.che.ufl.edu
>
> Tel:   (352)-392-6509
> FAX:   (352)-392-9514
>
>

Reply via email to