On 23/11/2023 14:14, Alastair McKinstry wrote:
On 23/11/2023 12:44, Drew Parsons wrote:
On 2023-11-23 12:13, Emilio Pozuelo Monfort wrote:
Hi,
On 23/11/2023 09:36, Alastair McKinstry wrote:
Hi,
OpenMPI has a new upstream release 5.0.0. It is in experimental now; the
SOVERSION for libraries remains 40.X (minor version increment), there is an
SOVERSION increment for private libraries only so in theory this is not an
ABI transition. However 5.0.0 drops 32-bit system support.
The default MPI implementation for each architecture is set in mpi-defaults;
this allows a per-arch MPI choice; in practice we currently use OpenMPI for
all archs. The other choice is MPICH.
So the question becomes: do we switch MPI for just 32-bit archs, or all?
What are the release teams opinion on this transition?
Having the same implementation across the board makes things easier
for testing purposes et al, however I don't see that as a blocker for
not having separate implementations.
True, in one sense it's simpler to have the same default MPI. But we've set
up our package infrastructure so that in principle it should not matter. One
architecture does not (or should not) depend on another, so it shouldn't break
packages just because we'd have different MPI implementations on different
architectures. On the contrary, "actively" using both implementations could
lead to more robust packages overall as MPI bugs get fixed against both
implementations.
What are your thoughts on it? Is there a strong reason why we should
stick with OpenMPI for 64bit releases? Or from a different POV, what
are the risks of changing the implementation? Introducing a different
set of bugs?
One point to consider is that upstream developers of several of our numerical
libraries have time and again suggested to us that we use mpich instead of
openmpi, even before this v5 shift. They perceive (rightly or wrongly) that
mpich is more robust, more reliable.
It would be useful to know whether that changes with v5, or whether their
complaints are historical and openmpi has already fixed the bugs that
concerned them. mpich has had its own share of bugs over the years. My memory
told me RMA support was an issue in openmpi, but when I checked my facts, it
was mpich that had to be fixed (https://github.com/pmodels/mpich/issues/6110)
Drew
My understanding is that MPICH has been typically the reference implementation,
higher quality but less performant, particularly with the range of fabrics.
Certainly I've seen mostly OpenMPI but not MPICH on various HPC machines. People
would use either OpenMPI or a vendors MPI (which may be forked MPICH with added
network hardware support).
I'd really like to hear from upstream users whether they are still encountering
OpenMPI issues.
Personally I favour splitting, using MPICH on 32-bit archs to flush out bugs,
and doing so early in the dev cycle (now) so there is time to change if necessary.
Thank you both for your comments.
I don't think as the Release Team we have a preference one way or the other.
We'll let you pick the approach that you consider better. Obviously the freeze
is still a long ways off, so if something comes up it can be changed later.
Cheers,
Emilio