Re: [OMPI users] Problem in starting openmpi job - no output just hangs

2020-08-23 Thread Tony Ladd via users

Hi John

Thanks for the response. I have run all those diagnostics, and as best I 
can tell the IB fabric is OK. I have a cluster of 49 nodes (48 clients + 
server) and the fabric passes all the tests. There is 1 warning:


I- Subnet: IPv4 PKey:0x7fff QKey:0x0b1b MTU:2048Byte rate:10Gbps SL:0x00
-W- Suboptimal rate for group. Lowest member rate:40Gbps > group-rate:10Gbps

but according to a number of sources this is harmless.

I have run Mellanox's P2P performance tests (ib_write_bw) between 
different pairs of nodes and it reports 3.22 GB/sec which is reasonable 
(its PCIe 2 x8 interface ie 4 GB/s). I have also configured 2 nodes back 
to back to check that the switch is not the problem - it makes no 
difference.


I have been playing with the btl params with openMPI (v. 2.1.1 which is 
what is relelased in Ubuntu 18.04). So with tcp as the transport layer 
everything works fine - 1 node or 2 node communication - I have tested 
up to 16 processes (8+8) and it seems fine. Of course the latency is 
much higher on the tcp interface, so I would still like to access the 
RDMA layer. But unless I exclude the openib module, it always hangs. 
Same with OpenMPI v4 compiled from source.


I think an important component is that Mellanox is not supporting 
Connect X2 for some time. This is really infuriating; a $500 network 
card with no supported drivers, but that is business for you I suppose. 
I have 50 NICS and I can't afford to replace them all. The other 
component is the MLNX-OFED is tied to specific software versions, so I 
can't just run an older set of drivers. I have not seen source files for 
the Mellanox drivers - I would take a crack at compiling them if I did. 
In the past I have used the OFED drivers (on Centos 5) with no problem, 
but I don't think this is an option now.


Ubuntu claims to support Connect X2 with their drivers (Mellanox 
confirms this), but of course this is community support and the number 
of cases is obviously small. I use the Ubuntu drivers right now because 
the OFED install seems broken and there is no help with it. Its not 
supported! Neat huh?


The only handle I have is with openmpi v. 2 when there is a message (see 
my original post) that ibv_obj returns a NULL result. But I don't 
understand the significance of the message (if any).


I am not enthused about UCX - the documentation has several obvious 
typos in it, which is not encouraging when you a floundering. I know its 
a newish project but I have used openib for 10+ years and its never had 
a problem until now. I think this is not so much openib as the software 
below. One other thing I should say is that if I run any recent version 
of mstflint is always complains:


Failed to identify the device - Can not create SignatureManager!

Going back to my original OFED 1.5 this did not happen, but they are at 
v5 now.


Everything else works as far as I can see. But I could not burn new 
firmware except by going back to the 1.5 OS. Perhaps this is connected 
with the obv_obj = NULL result.


Thanks for helping out. As you can see I am rather stuck.

Best

Tony

On 8/23/20 3:01 AM, John Hearns via users wrote:

*[External Email]*

Tony, start at a low level. Is the Infiniband fabric healthy?
Run
ibstatus   on every node
sminfo on one node
ibdiagnet on one node

On Sun, 23 Aug 2020 at 05:02, Tony Ladd via users 
mailto:users@lists.open-mpi.org>> wrote:


Hi Jeff

I installed ucx as you suggested. But I can't get even the
simplest code
(ucp_client_server) to work across the network. I can compile openMPI
with UCX but it has the same problem - mpi codes will not execute and
there are no messages. Really, UCX is not helping. It is adding
another
(not so well documented) software layer, which does not offer better
diagnostics as far as I can see. Its also unclear to me how to
control
what drivers are being loaded - UCX wants to make that decision
for you.
With openMPI I can see that (for instance) the tcp module works both
locally and over the network - it must be using the Mellanox NIC
for the
bandwidth it is reporting on IMB-MPI1 even with tcp protocols. But
if I
try to use openib (or allow ucx or openmpi to choose the transport
layer) it just hangs. Annoyingly I have this server where everything
works just fine - I can run locally over openib and its fine. All the
other nodes cannot seem to load openib so even local jobs fail.

The only good (as best I can tell) diagnostic is from openMPI.
ibv_obj
(from v2.x) complains  that openib returns a NULL object, whereas
on my
server it returns logical_index=1. Can we not try to diagnose the
problem with openib not loading (see my original post for
details). I am
pretty sure if we can that would fix the problem.

Thanks

Tony

PS I tried configuring two nodes back to back to see if it was a
switch
issue, but the result was the same.


On 

Re: [OMPI users] Problem in starting openmpi job - no output just hangs

2020-08-23 Thread John Hearns via users
Tony, start at a low level. Is the Infiniband fabric healthy?
Run
ibstatus   on every node
sminfo on one node
ibdiagnet on one node

On Sun, 23 Aug 2020 at 05:02, Tony Ladd via users 
wrote:

> Hi Jeff
>
> I installed ucx as you suggested. But I can't get even the simplest code
> (ucp_client_server) to work across the network. I can compile openMPI
> with UCX but it has the same problem - mpi codes will not execute and
> there are no messages. Really, UCX is not helping. It is adding another
> (not so well documented) software layer, which does not offer better
> diagnostics as far as I can see. Its also unclear to me how to control
> what drivers are being loaded - UCX wants to make that decision for you.
> With openMPI I can see that (for instance) the tcp module works both
> locally and over the network - it must be using the Mellanox NIC for the
> bandwidth it is reporting on IMB-MPI1 even with tcp protocols. But if I
> try to use openib (or allow ucx or openmpi to choose the transport
> layer) it just hangs. Annoyingly I have this server where everything
> works just fine - I can run locally over openib and its fine. All the
> other nodes cannot seem to load openib so even local jobs fail.
>
> The only good (as best I can tell) diagnostic is from openMPI. ibv_obj
> (from v2.x) complains  that openib returns a NULL object, whereas on my
> server it returns logical_index=1. Can we not try to diagnose the
> problem with openib not loading (see my original post for details). I am
> pretty sure if we can that would fix the problem.
>
> Thanks
>
> Tony
>
> PS I tried configuring two nodes back to back to see if it was a switch
> issue, but the result was the same.
>
>
> On 8/19/20 1:27 PM, Jeff Squyres (jsquyres) wrote:
> > [External Email]
> >
> > Tony --
> >
> > Have you tried compiling Open MPI with UCX support?  This is Mellanox
> (NVIDIA's) preferred mechanism for InfiniBand support these days -- the
> openib BTL is legacy.
> >
> > You can run: mpirun --mca pml ucx ...
> >
> >
> >> On Aug 19, 2020, at 12:46 PM, Tony Ladd via users <
> users@lists.open-mpi.org> wrote:
> >>
> >> One other update. I compiled OpenMPI-4.0.4 The outcome was the same but
> there is no mention of ibv_obj this time.
> >>
> >> Tony
> >>
> >> --
> >>
> >> Tony Ladd
> >>
> >> Chemical Engineering Department
> >> University of Florida
> >> Gainesville, Florida 32611-6005
> >> USA
> >>
> >> Email: tladd-"(AT)"-che.ufl.edu
> >> Webhttp://ladd.che.ufl.edu
> >>
> >> Tel:   (352)-392-6509
> >> FAX:   (352)-392-9514
> >>
> >> 
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> >
> --
> Tony Ladd
>
> Chemical Engineering Department
> University of Florida
> Gainesville, Florida 32611-6005
> USA
>
> Email: tladd-"(AT)"-che.ufl.edu
> Webhttp://ladd.che.ufl.edu
>
> Tel:   (352)-392-6509
> FAX:   (352)-392-9514
>
>