Tony, start at a low level. Is the Infiniband fabric healthy? Run ibstatus on every node sminfo on one node ibdiagnet on one node
On Sun, 23 Aug 2020 at 05:02, Tony Ladd via users <users@lists.open-mpi.org> wrote: > Hi Jeff > > I installed ucx as you suggested. But I can't get even the simplest code > (ucp_client_server) to work across the network. I can compile openMPI > with UCX but it has the same problem - mpi codes will not execute and > there are no messages. Really, UCX is not helping. It is adding another > (not so well documented) software layer, which does not offer better > diagnostics as far as I can see. Its also unclear to me how to control > what drivers are being loaded - UCX wants to make that decision for you. > With openMPI I can see that (for instance) the tcp module works both > locally and over the network - it must be using the Mellanox NIC for the > bandwidth it is reporting on IMB-MPI1 even with tcp protocols. But if I > try to use openib (or allow ucx or openmpi to choose the transport > layer) it just hangs. Annoyingly I have this server where everything > works just fine - I can run locally over openib and its fine. All the > other nodes cannot seem to load openib so even local jobs fail. > > The only good (as best I can tell) diagnostic is from openMPI. ibv_obj > (from v2.x) complains that openib returns a NULL object, whereas on my > server it returns logical_index=1. Can we not try to diagnose the > problem with openib not loading (see my original post for details). I am > pretty sure if we can that would fix the problem. > > Thanks > > Tony > > PS I tried configuring two nodes back to back to see if it was a switch > issue, but the result was the same. > > > On 8/19/20 1:27 PM, Jeff Squyres (jsquyres) wrote: > > [External Email] > > > > Tony -- > > > > Have you tried compiling Open MPI with UCX support? This is Mellanox > (NVIDIA's) preferred mechanism for InfiniBand support these days -- the > openib BTL is legacy. > > > > You can run: mpirun --mca pml ucx ... > > > > > >> On Aug 19, 2020, at 12:46 PM, Tony Ladd via users < > users@lists.open-mpi.org> wrote: > >> > >> One other update. I compiled OpenMPI-4.0.4 The outcome was the same but > there is no mention of ibv_obj this time. > >> > >> Tony > >> > >> -- > >> > >> Tony Ladd > >> > >> Chemical Engineering Department > >> University of Florida > >> Gainesville, Florida 32611-6005 > >> USA > >> > >> Email: tladd-"(AT)"-che.ufl.edu > >> Web http://ladd.che.ufl.edu > >> > >> Tel: (352)-392-6509 > >> FAX: (352)-392-9514 > >> > >> <outf34-4.0><outfoam-4.0> > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > > -- > Tony Ladd > > Chemical Engineering Department > University of Florida > Gainesville, Florida 32611-6005 > USA > > Email: tladd-"(AT)"-che.ufl.edu > Web http://ladd.che.ufl.edu > > Tel: (352)-392-6509 > FAX: (352)-392-9514 > >