Hi John
Thanks for the response. I have run all those diagnostics, and as best I
can tell the IB fabric is OK. I have a cluster of 49 nodes (48 clients +
server) and the fabric passes all the tests. There is 1 warning:
I- Subnet: IPv4 PKey:0x7fff QKey:0x0b1b MTU:2048Byte rate:10Gbps SL:0x00
-W- Suboptimal rate for group. Lowest member rate:40Gbps > group-rate:10Gbps
but according to a number of sources this is harmless.
I have run Mellanox's P2P performance tests (ib_write_bw) between
different pairs of nodes and it reports 3.22 GB/sec which is reasonable
(its PCIe 2 x8 interface ie 4 GB/s). I have also configured 2 nodes back
to back to check that the switch is not the problem - it makes no
difference.
I have been playing with the btl params with openMPI (v. 2.1.1 which is
what is relelased in Ubuntu 18.04). So with tcp as the transport layer
everything works fine - 1 node or 2 node communication - I have tested
up to 16 processes (8+8) and it seems fine. Of course the latency is
much higher on the tcp interface, so I would still like to access the
RDMA layer. But unless I exclude the openib module, it always hangs.
Same with OpenMPI v4 compiled from source.
I think an important component is that Mellanox is not supporting
Connect X2 for some time. This is really infuriating; a $500 network
card with no supported drivers, but that is business for you I suppose.
I have 50 NICS and I can't afford to replace them all. The other
component is the MLNX-OFED is tied to specific software versions, so I
can't just run an older set of drivers. I have not seen source files for
the Mellanox drivers - I would take a crack at compiling them if I did.
In the past I have used the OFED drivers (on Centos 5) with no problem,
but I don't think this is an option now.
Ubuntu claims to support Connect X2 with their drivers (Mellanox
confirms this), but of course this is community support and the number
of cases is obviously small. I use the Ubuntu drivers right now because
the OFED install seems broken and there is no help with it. Its not
supported! Neat huh?
The only handle I have is with openmpi v. 2 when there is a message (see
my original post) that ibv_obj returns a NULL result. But I don't
understand the significance of the message (if any).
I am not enthused about UCX - the documentation has several obvious
typos in it, which is not encouraging when you a floundering. I know its
a newish project but I have used openib for 10+ years and its never had
a problem until now. I think this is not so much openib as the software
below. One other thing I should say is that if I run any recent version
of mstflint is always complains:
Failed to identify the device - Can not create SignatureManager!
Going back to my original OFED 1.5 this did not happen, but they are at
v5 now.
Everything else works as far as I can see. But I could not burn new
firmware except by going back to the 1.5 OS. Perhaps this is connected
with the obv_obj = NULL result.
Thanks for helping out. As you can see I am rather stuck.
Best
Tony
On 8/23/20 3:01 AM, John Hearns via users wrote:
*[External Email]*
Tony, start at a low level. Is the Infiniband fabric healthy?
Run
ibstatus on every node
sminfo on one node
ibdiagnet on one node
On Sun, 23 Aug 2020 at 05:02, Tony Ladd via users
mailto:users@lists.open-mpi.org>> wrote:
Hi Jeff
I installed ucx as you suggested. But I can't get even the
simplest code
(ucp_client_server) to work across the network. I can compile openMPI
with UCX but it has the same problem - mpi codes will not execute and
there are no messages. Really, UCX is not helping. It is adding
another
(not so well documented) software layer, which does not offer better
diagnostics as far as I can see. Its also unclear to me how to
control
what drivers are being loaded - UCX wants to make that decision
for you.
With openMPI I can see that (for instance) the tcp module works both
locally and over the network - it must be using the Mellanox NIC
for the
bandwidth it is reporting on IMB-MPI1 even with tcp protocols. But
if I
try to use openib (or allow ucx or openmpi to choose the transport
layer) it just hangs. Annoyingly I have this server where everything
works just fine - I can run locally over openib and its fine. All the
other nodes cannot seem to load openib so even local jobs fail.
The only good (as best I can tell) diagnostic is from openMPI.
ibv_obj
(from v2.x) complains that openib returns a NULL object, whereas
on my
server it returns logical_index=1. Can we not try to diagnose the
problem with openib not loading (see my original post for
details). I am
pretty sure if we can that would fix the problem.
Thanks
Tony
PS I tried configuring two nodes back to back to see if it was a
switch
issue, but the result was the same.
On