Some of my collaborators have had issues with one of my benchmarks at high 
concurrency (82K MPI procs) on the K machine in Japan.  I believe K uses 
OpenMPI and the issues has been tracked to time in MPI_Comm_dup/Comm_split 
increasing quadratically with process concurrency.  At 82K processes, each call 
to dup/split is taking 15s to complete.  These high times restrict 
comm_split/dup to be used statically (at the beginning) and not dynamically in 
an application.

I had a similar issue a few years ago on ANL/Mira/MPICH where they called qsort 
to split the ranks.  Although qsort/quicksort has ideal computational 
complexity of O(PlogP)  [P is the number of MPI ranks], it can have worst case 
complexity of O(P^2)... at 82K, P/logP is a 5000x slowdown.  

Can you confirm whether qsort (or the like) is (still) used in these routines 
in OpenMPI?  It seems mergesort (worst case complexity of PlogP) would be a 
more scalable approach.  I have not observed this issue on the Cray MPICH 
implementation and the Mira MPICH issues has since been resolved.


_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Reply via email to