On Aug 24, 2020, at 9:44 PM, Tony Ladd <tl...@che.ufl.edu> wrote:
> 
> I appreciate your help (and John's as well). At this point I don't think is 
> an OMPI problem - my mistake. I think the communication with RDMA is somehow 
> disabled (perhaps its the verbs layer - I am not very knowledgeable with 
> this). It used to work like a dream but Mellanox has apparently disabled some 
> of the Connect X2 components, because neither ompi or ucx (with/without ompi) 
> could connect with the RDMA layer. Some of the infiniband functions are also 
> not working on the X2 (mstflint, mstconfig).

If the IB stack itself is not functioning, then you're right: Open MPI won't 
work, either (with openib or UCX).

You can try to keep poking with the low-layer diagnostic tools like ibv_devinfo 
and ibv_rc_pingpong.  If those don't work, Open MPI won't work over IB, either.

> In fact ompi always tries to access the openib module. I have to explicitly 
> disable it even to run on 1 node.

Yes, that makes sense: Open MPI will aggressively try to use every possible 
mechanism.

> So I think it is in initialization not communication that the problem lies.

I'm not sure that's correct.

>From your initial emails, it looks like openib thinks it initialized properly.

> This is why (I think) ibv_obj returns NULL.

I'm not sure if that's a problem or not.  That section of output is where Open 
MPI is measuring the distance from the current process to the PCI bus where the 
device lives.  I don't remember offhand if returning NULL in that area is 
actually a problem or just an indication of some kind of non-error condition.

Specifically: if returning NULL there was a problem, we *probably* would have 
aborted at that point.  I have not looked at the code to verify that, though.

> The better news is that with the tcp stack everything works fine (ompi, ucx, 
> 1 node, many nodes) - the bandwidth is similar to rdma so for large messages 
> its semi OK. Its a partial solution - not all I wanted of course. The direct 
> rdma functions ib_read_lat etc also work fine with expected results. I am 
> suspicious this disabling of the driver is a commercial more than a technical 
> decision.
> I am going to try going back to Ubuntu 16.04 - there is a version of OFED 
> that still supports the X2. But I think it may still get messed up by kernel 
> upgrades (it does for 18.04 I found). So its not an easy path.


I can't speak for Nvidia here, sorry.

-- 
Jeff Squyres
jsquy...@cisco.com

Reply via email to