On Aug 24, 2020, at 9:44 PM, Tony Ladd <tl...@che.ufl.edu> wrote: > > I appreciate your help (and John's as well). At this point I don't think is > an OMPI problem - my mistake. I think the communication with RDMA is somehow > disabled (perhaps its the verbs layer - I am not very knowledgeable with > this). It used to work like a dream but Mellanox has apparently disabled some > of the Connect X2 components, because neither ompi or ucx (with/without ompi) > could connect with the RDMA layer. Some of the infiniband functions are also > not working on the X2 (mstflint, mstconfig).
If the IB stack itself is not functioning, then you're right: Open MPI won't work, either (with openib or UCX). You can try to keep poking with the low-layer diagnostic tools like ibv_devinfo and ibv_rc_pingpong. If those don't work, Open MPI won't work over IB, either. > In fact ompi always tries to access the openib module. I have to explicitly > disable it even to run on 1 node. Yes, that makes sense: Open MPI will aggressively try to use every possible mechanism. > So I think it is in initialization not communication that the problem lies. I'm not sure that's correct. >From your initial emails, it looks like openib thinks it initialized properly. > This is why (I think) ibv_obj returns NULL. I'm not sure if that's a problem or not. That section of output is where Open MPI is measuring the distance from the current process to the PCI bus where the device lives. I don't remember offhand if returning NULL in that area is actually a problem or just an indication of some kind of non-error condition. Specifically: if returning NULL there was a problem, we *probably* would have aborted at that point. I have not looked at the code to verify that, though. > The better news is that with the tcp stack everything works fine (ompi, ucx, > 1 node, many nodes) - the bandwidth is similar to rdma so for large messages > its semi OK. Its a partial solution - not all I wanted of course. The direct > rdma functions ib_read_lat etc also work fine with expected results. I am > suspicious this disabling of the driver is a commercial more than a technical > decision. > I am going to try going back to Ubuntu 16.04 - there is a version of OFED > that still supports the X2. But I think it may still get messed up by kernel > upgrades (it does for 18.04 I found). So its not an easy path. I can't speak for Nvidia here, sorry. -- Jeff Squyres jsquy...@cisco.com