Jeff

I found the solution - rdma needs significant memory so the limits on the shell have to be increased. I needed to add the lines

* soft memlock unlimited
* hard memlock unlimited

to the end of the file /etc/security/limits.conf. After that the openib driver loads and everything is fine - proper IB latency again.

I see that # 16 of the tuning FAQ discusses the same issue, but in my case there was no error or warning message. I am posting this in case anyone else runs into this issue.

The Mellanox OFED install adds those lines automatically, so I had not run into this before.

Tony


On 8/25/20 10:42 AM, Jeff Squyres (jsquyres) wrote:
[External Email]

On Aug 24, 2020, at 9:44 PM, Tony Ladd <tl...@che.ufl.edu> wrote:
I appreciate your help (and John's as well). At this point I don't think is an 
OMPI problem - my mistake. I think the communication with RDMA is somehow 
disabled (perhaps its the verbs layer - I am not very knowledgeable with this). 
It used to work like a dream but Mellanox has apparently disabled some of the 
Connect X2 components, because neither ompi or ucx (with/without ompi) could 
connect with the RDMA layer. Some of the infiniband functions are also not 
working on the X2 (mstflint, mstconfig).
If the IB stack itself is not functioning, then you're right: Open MPI won't 
work, either (with openib or UCX).

You can try to keep poking with the low-layer diagnostic tools like ibv_devinfo 
and ibv_rc_pingpong.  If those don't work, Open MPI won't work over IB, either.

In fact ompi always tries to access the openib module. I have to explicitly 
disable it even to run on 1 node.
Yes, that makes sense: Open MPI will aggressively try to use every possible 
mechanism.

So I think it is in initialization not communication that the problem lies.
I'm not sure that's correct.

 From your initial emails, it looks like openib thinks it initialized properly.

This is why (I think) ibv_obj returns NULL.
I'm not sure if that's a problem or not.  That section of output is where Open 
MPI is measuring the distance from the current process to the PCI bus where the 
device lives.  I don't remember offhand if returning NULL in that area is 
actually a problem or just an indication of some kind of non-error condition.

Specifically: if returning NULL there was a problem, we *probably* would have 
aborted at that point.  I have not looked at the code to verify that, though.

The better news is that with the tcp stack everything works fine (ompi, ucx, 1 
node, many nodes) - the bandwidth is similar to rdma so for large messages its 
semi OK. Its a partial solution - not all I wanted of course. The direct rdma 
functions ib_read_lat etc also work fine with expected results. I am suspicious 
this disabling of the driver is a commercial more than a technical decision.
I am going to try going back to Ubuntu 16.04 - there is a version of OFED that 
still supports the X2. But I think it may still get messed up by kernel 
upgrades (it does for 18.04 I found). So its not an easy path.

I can't speak for Nvidia here, sorry.

--
Jeff Squyres
jsquy...@cisco.com

--
Tony Ladd

Chemical Engineering Department
University of Florida
Gainesville, Florida 32611-6005
USA

Email: tladd-"(AT)"-che.ufl.edu
Web    http://ladd.che.ufl.edu

Tel:   (352)-392-6509
FAX:   (352)-392-9514

Reply via email to