I'll ask my collaborators if they've submitted a ticket. (they have the accounts; built the code; ran the code; observed the issues)
I believe the issue on MPICH was a qsort issue and not a Allreduce issue. When this is coupled with the fact that it looked like qsort is called in ompi_comm_split (https://github.com/open-mpi/ompi/blob/a7a30424cba6482c97f8f2f7febe53aaa180c91e/ompi/communicator/comm.c), I wanted to raise the issue so that it may be investigated to understand whether users can naively blunder into worst case computational complexity issues. We've been running hpgmg-fv (not -fe). They were using the flux variants (requires local.mk build operators.flux.c instead of operators.fv4.c) and they are a couple commits behind. Regardless, this issue has persisted on K for several years. By default, it will build log(N) subcommunicators where N is the problem size. Weak scaling experiments has shown comm_split/dup times growing consistently with worst case complexity. That being said, AMR codes might rebuild the sub communicators as they regrid/adapt. > On Nov 7, 2017, at 8:33 AM, Gilles Gouaillardet > <gilles.gouaillar...@gmail.com> wrote: > > Samuel, > > The default MPI library on the K computer is Fujitsu MPI, and yes, it > is based on Open MPI. > /* fwiw, an alternative is RIKEN MPI, and it is MPICH based */ > From a support perspective, this should be reported to the HPCI > helpdesk http://www.hpci-office.jp/pages/e_support > > As far as i understand, Fujitsu MPI currently available on K is not > based on the latest Open MPI. > I suspect most of the time is spent trying to find the new > communicator ID (CID) when a communicator is created (vs figuring out > the new ranks) > iirc, on older versions of Open MPI, that was implemented with as many > MPI_Allreduce(MPI_MAX) as needed to figure out the smallest common > unused CID for the newly created communicator. > > So if you MPI_Comm_dup(MPI_COMM_WORLD) n times at the beginning of > your program, only one MPI_Allreduce() should be involved per > MPI_Comm_dup(). > But if you do the same thing in the middle of your run, and after each > rank has a different lower unused CID, the performances can be (much) > worst. > If i understand correctly your description of the issue, that would > explain the performance discrepancy between static vs dynamic > communicator creation time. > > fwiw, this part has been (highly) improved in the latest releases of Open MPI. > > If your benchmark is available for download, could you please post a link ? > > > Cheers, > > Gilles > > On Wed, Nov 8, 2017 at 12:04 AM, Samuel Williams <swwilli...@lbl.gov> wrote: >> Some of my collaborators have had issues with one of my benchmarks at high >> concurrency (82K MPI procs) on the K machine in Japan. I believe K uses >> OpenMPI and the issues has been tracked to time in MPI_Comm_dup/Comm_split >> increasing quadratically with process concurrency. At 82K processes, each >> call to dup/split is taking 15s to complete. These high times restrict >> comm_split/dup to be used statically (at the beginning) and not dynamically >> in an application. >> >> I had a similar issue a few years ago on ANL/Mira/MPICH where they called >> qsort to split the ranks. Although qsort/quicksort has ideal computational >> complexity of O(PlogP) [P is the number of MPI ranks], it can have worst >> case complexity of O(P^2)... at 82K, P/logP is a 5000x slowdown. >> >> Can you confirm whether qsort (or the like) is (still) used in these >> routines in OpenMPI? It seems mergesort (worst case complexity of PlogP) >> would be a more scalable approach. I have not observed this issue on the >> Cray MPICH implementation and the Mira MPICH issues has since been resolved. >> >> >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/devel > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/devel _______________________________________________ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel