On 2018-12-07 19:49, Alastair McKinstry wrote:
On 07/12/2018 11:26, Drew Parsons wrote:
Hi Alistair, openmpi3 seems to be stabilised now, packages are now
passing tests and libpsm2 is no longer injecting 15 sec delays.
Nice that the mpich 3.3 release is now finalised. Do we feel
confident proceeding with the switch of mpi-defaults from openmpi to
mpich?
Are there any know issues with the transition? One that catches my
eye are the build failures in scalapack. It's been tuned to pass
built time tests with openmpi but fails many tests with mpich
(scalapack builds packages for both mpi implementations). I'm not sure
how concerned we should be with those build failures. Perhaps upstream
should be consulted on it. Are similar mpich failures expected in
other packages? Is there a simple way of setting up a buildd to do a
test run of the transition before making it official?
Drew
Hi Drew,
Looking into it further, I'm reluctant now to move to mpich for buster
as the default. One was the experience of the openmpi3 transition,
shaking out many issues.
I suspect we could see the same with other package builds, as you
point out, tuned to openmpi rather than mpich, but also the feature
support for mpich.
e.g. mpich integration with psm / pmix / slurm is weak (in Debian).
While it might not look important to be able to scale to 10k+ nodes on
Debian (as none of the top500 machines run Debian), we're seeing an
increase in the container use case: building mpi apps within
Singularity containers running on our main machine. We don't run
Debian as the OS on the base supercomputer at work because we need
kernel support from $vendor, but the apps are built in Singularity
containers running Debian ... v. large scale jobs becom increasingly
likely, and openmpi / pmix is needed for that. Testing mpich I've yet
to get CH4 working reliably - needed for pmix, and the OFI / UFX
support is labeled 'experimental'.
My driving use case for the move to mpich had been fault tolerance -
needed for co-arrays (https://tracker.debian.org/pkg/open-coarrays)
needed for Fortran 2018, but i've since re-done open-coarrays to build
both openmpi and mpich variants, so that issue went away.
So I think more testing of mpich3 builds with CH4 /pmix / OFI support
is needed, but moving over openmpi-> mpich at this stage is iffy.
Thanks Alistair, your analysis sounds sound. I'm happy to be patient
with the switch and wait till after buster, especially if pmix support
complicates the matter. That will make it all the more useful to set up
a test buildd to test the transition.
I'll invite upstream authors who have been promoting mpich over openmpi
to chip in with their experience.
Drew