Samuel, The default MPI library on the K computer is Fujitsu MPI, and yes, it is based on Open MPI. /* fwiw, an alternative is RIKEN MPI, and it is MPICH based */ >From a support perspective, this should be reported to the HPCI helpdesk http://www.hpci-office.jp/pages/e_support
As far as i understand, Fujitsu MPI currently available on K is not based on the latest Open MPI. I suspect most of the time is spent trying to find the new communicator ID (CID) when a communicator is created (vs figuring out the new ranks) iirc, on older versions of Open MPI, that was implemented with as many MPI_Allreduce(MPI_MAX) as needed to figure out the smallest common unused CID for the newly created communicator. So if you MPI_Comm_dup(MPI_COMM_WORLD) n times at the beginning of your program, only one MPI_Allreduce() should be involved per MPI_Comm_dup(). But if you do the same thing in the middle of your run, and after each rank has a different lower unused CID, the performances can be (much) worst. If i understand correctly your description of the issue, that would explain the performance discrepancy between static vs dynamic communicator creation time. fwiw, this part has been (highly) improved in the latest releases of Open MPI. If your benchmark is available for download, could you please post a link ? Cheers, Gilles On Wed, Nov 8, 2017 at 12:04 AM, Samuel Williams <swwilli...@lbl.gov> wrote: > Some of my collaborators have had issues with one of my benchmarks at high > concurrency (82K MPI procs) on the K machine in Japan. I believe K uses > OpenMPI and the issues has been tracked to time in MPI_Comm_dup/Comm_split > increasing quadratically with process concurrency. At 82K processes, each > call to dup/split is taking 15s to complete. These high times restrict > comm_split/dup to be used statically (at the beginning) and not dynamically > in an application. > > I had a similar issue a few years ago on ANL/Mira/MPICH where they called > qsort to split the ranks. Although qsort/quicksort has ideal computational > complexity of O(PlogP) [P is the number of MPI ranks], it can have worst > case complexity of O(P^2)... at 82K, P/logP is a 5000x slowdown. > > Can you confirm whether qsort (or the like) is (still) used in these routines > in OpenMPI? It seems mergesort (worst case complexity of PlogP) would be a > more scalable approach. I have not observed this issue on the Cray MPICH > implementation and the Mira MPICH issues has since been resolved. > > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/devel _______________________________________________ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel