Thanks Ralph, Now I get what you had in mind.
Strictly speaking, you are making the assumption that Open MPI performance matches the system MPI performances. This is generally true for common interconnects and/or those that feature providers for libfabric or UCX, but not so for "exotic" interconnects (that might not be supported natively by Open MPI or abstraction layers) and/or with an uncommon topology (for which collective communications are not fully optimized by Open MPI). In the latter case, using the system/vendor MPI is the best option performance wise. Cheers, Gilles On Fri, Jan 28, 2022 at 2:23 AM Ralph Castain via users < users@lists.open-mpi.org> wrote: > Just to complete this - there is always a lingering question regarding > shared memory support. There are two ways to resolve that one: > > * run one container per physical node, launching multiple procs in each > container. The procs can then utilize shared memory _inside_ the container. > This is the cleanest solution (i.e., minimizes container boundary > violations), but some users need/want per-process isolation. > > * run one container per MPI process, having each container then mount an > _external_ common directory to an internal mount point. This allows each > process to access the common shared memory location. As with the device > drivers, you typically specify that external mount location when launching > the container. > > Using those combined methods, you can certainly have a "generic" container > that suffers no performance impact from bare metal. The problem has been > that it takes a certain degree of "container savvy" to set this up and make > it work - which is beyond what most users really want to learn. I'm sure > the container community is working on ways to reduce that burden (I'm not > really plugged into those efforts, but others on this list might be). > > Ralph > > > > On Jan 27, 2022, at 7:39 AM, Ralph H Castain <r...@open-mpi.org> wrote: > > > >> Fair enough Ralph! I was implicitly assuming a "build once / run > everywhere" use case, my bad for not making my assumption clear. > >> If the container is built to run on a specific host, there are indeed > other options to achieve near native performances. > >> > > > > Err...that isn't actually what I meant, nor what we did. You can, in > fact, build a container that can "run everywhere" while still employing > high-speed fabric support. What you do is: > > > > * configure OMPI with all the fabrics enabled (or at least all the ones > you care about) > > > > * don't include the fabric drivers in your container. These can/will > vary across deployments, especially those (like NVIDIA's) that involve > kernel modules > > > > * setup your container to mount specified external device driver > locations onto the locations where you configured OMPI to find them. Sadly, > this does violate the container boundary - but nobody has come up with > another solution, and at least the violation is confined to just the device > drivers. Typically, you specify the external locations that are to be > mounted using an envar or some other mechanism appropriate to your > container, and then include the relevant information when launching the > containers. > > > > When OMPI initializes, it will do its normal procedure of attempting to > load each fabric's drivers, selecting the transports whose drivers it can > load. NOTE: beginning with OMPI v5, you'll need to explicitly tell OMPI to > build without statically linking in the fabric plugins or else this > probably will fail. > > > > At least one vendor now distributes OMPI containers preconfigured with > their fabric support based on this method. So using a "generic" container > doesn't mean you lose performance - in fact, our tests showed zero impact > on performance using this method. > > > > HTH > > Ralph > > > > >