Some of my collaborators have had issues with one of my benchmarks at high concurrency (82K MPI procs) on the K machine in Japan. I believe K uses OpenMPI and the issues has been tracked to time in MPI_Comm_dup/Comm_split increasing quadratically with process concurrency. At 82K processes, each call to dup/split is taking 15s to complete. These high times restrict comm_split/dup to be used statically (at the beginning) and not dynamically in an application.
I had a similar issue a few years ago on ANL/Mira/MPICH where they called qsort to split the ranks. Although qsort/quicksort has ideal computational complexity of O(PlogP) [P is the number of MPI ranks], it can have worst case complexity of O(P^2)... at 82K, P/logP is a 5000x slowdown. Can you confirm whether qsort (or the like) is (still) used in these routines in OpenMPI? It seems mergesort (worst case complexity of PlogP) would be a more scalable approach. I have not observed this issue on the Cray MPICH implementation and the Mira MPICH issues has since been resolved. _______________________________________________ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel