Sure - but then we aren't talking about containers any more, just vendor vs 
OMPI. I'm not getting in the middle of that one!

On Jan 27, 2022, at 6:28 PM, Gilles Gouaillardet via users 
< <> > wrote:

Thanks Ralph,

Now I get what you had in mind.

Strictly speaking, you are making the assumption that Open MPI performance 
matches the system MPI performances.

This is generally true for common interconnects and/or those that feature 
providers for libfabric or UCX, but not so for "exotic" interconnects (that 
might not be supported natively by Open MPI or abstraction layers) and/or with 
an uncommon topology (for which collective communications are not fully 
optimized by Open MPI). In the latter case, using the system/vendor MPI is the 
best option performance wise.



On Fri, Jan 28, 2022 at 2:23 AM Ralph Castain via users 
< <> > wrote:
Just to complete this - there is always a lingering question regarding shared 
memory support. There are two ways to resolve that one:

* run one container per physical node, launching multiple procs in each 
container. The procs can then utilize shared memory _inside_ the container. 
This is the cleanest solution (i.e., minimizes container boundary violations), 
but some users need/want per-process isolation.

* run one container per MPI process, having each container then mount an 
_external_ common directory to an internal mount point. This allows each 
process to access the common shared memory location. As with the device 
drivers, you typically specify that external mount location when launching the 

Using those combined methods, you can certainly have a "generic" container that 
suffers no performance impact from bare metal. The problem has been that it 
takes a certain degree of "container savvy" to set this up and make it work - 
which is beyond what most users really want to learn. I'm sure the container 
community is working on ways to reduce that burden (I'm not really plugged into 
those efforts, but others on this list might be).


> On Jan 27, 2022, at 7:39 AM, Ralph H Castain < 
> <> > wrote:
>> Fair enough Ralph! I was implicitly assuming a "build once / run everywhere" 
>> use case, my bad for not making my assumption clear.
>> If the container is built to run on a specific host, there are indeed other 
>> options to achieve near native performances.
> Err...that isn't actually what I meant, nor what we did. You can, in fact, 
> build a container that can "run everywhere" while still employing high-speed 
> fabric support. What you do is:
> * configure OMPI with all the fabrics enabled (or at least all the ones you 
> care about)
> * don't include the fabric drivers in your container. These can/will vary 
> across deployments, especially those (like NVIDIA's) that involve kernel 
> modules
> * setup your container to mount specified external device driver locations 
> onto the locations where you configured OMPI to find them. Sadly, this does 
> violate the container boundary - but nobody has come up with another 
> solution, and at least the violation is confined to just the device drivers. 
> Typically, you specify the external locations that are to be mounted using an 
> envar or some other mechanism appropriate to your container, and then include 
> the relevant information when launching the containers.

> When OMPI initializes, it will do its normal procedure of attempting to load 
> each fabric's drivers, selecting the transports whose drivers it can load. 
> NOTE: beginning with OMPI v5, you'll need to explicitly tell OMPI to build 
> without statically linking in the fabric plugins or else this probably will 
> fail.
> At least one vendor now distributes OMPI containers preconfigured with their 
> fabric support based on this method. So using a "generic" container doesn't 
> mean you lose performance - in fact, our tests showed zero impact on 
> performance using this method.
> Ralph

Reply via email to