Samuel,

The default MPI library on the K computer is Fujitsu MPI, and yes, it
is based on Open MPI.
/* fwiw, an alternative is RIKEN MPI, and it is MPICH based */
>From a support perspective, this should be reported to the HPCI
helpdesk http://www.hpci-office.jp/pages/e_support

As far as i understand, Fujitsu MPI currently available on K is not
based on the latest Open MPI.
I suspect most of the time is spent trying to find the new
communicator ID (CID) when a communicator is created (vs figuring out
the new ranks)
iirc, on older versions of Open MPI, that was implemented with as many
MPI_Allreduce(MPI_MAX) as needed to figure out the smallest common
unused CID for the newly created communicator.

So if you MPI_Comm_dup(MPI_COMM_WORLD) n times at the beginning of
your program, only one MPI_Allreduce() should be involved per
MPI_Comm_dup().
But if you do the same thing in the middle of your run, and after each
rank has a different lower unused CID, the performances can be (much)
worst.
If i understand correctly your description of the issue, that would
explain the performance discrepancy between static vs dynamic
communicator creation time.

fwiw, this part has been (highly) improved in the latest releases of Open MPI.

If your benchmark is available for download, could you please post a link ?


Cheers,

Gilles

On Wed, Nov 8, 2017 at 12:04 AM, Samuel Williams <swwilli...@lbl.gov> wrote:
> Some of my collaborators have had issues with one of my benchmarks at high 
> concurrency (82K MPI procs) on the K machine in Japan.  I believe K uses 
> OpenMPI and the issues has been tracked to time in MPI_Comm_dup/Comm_split 
> increasing quadratically with process concurrency.  At 82K processes, each 
> call to dup/split is taking 15s to complete.  These high times restrict 
> comm_split/dup to be used statically (at the beginning) and not dynamically 
> in an application.
>
> I had a similar issue a few years ago on ANL/Mira/MPICH where they called 
> qsort to split the ranks.  Although qsort/quicksort has ideal computational 
> complexity of O(PlogP)  [P is the number of MPI ranks], it can have worst 
> case complexity of O(P^2)... at 82K, P/logP is a 5000x slowdown.
>
> Can you confirm whether qsort (or the like) is (still) used in these routines 
> in OpenMPI?  It seems mergesort (worst case complexity of PlogP) would be a 
> more scalable approach.  I have not observed this issue on the Cray MPICH 
> implementation and the Mira MPICH issues has since been resolved.
>
>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Reply via email to