I'll ask my collaborators if they've submitted a ticket.
(they have the accounts; built the code; ran the code; observed the issues)

I believe the issue on MPICH was a qsort issue and not a Allreduce issue.  When 
this is coupled with the fact that it looked like qsort is called in 
ompi_comm_split 
(https://github.com/open-mpi/ompi/blob/a7a30424cba6482c97f8f2f7febe53aaa180c91e/ompi/communicator/comm.c),
 I wanted to raise the issue so that it may be investigated to understand 
whether users can naively blunder into worst case computational complexity 
issues.

We've been running hpgmg-fv (not -fe).  They were using the flux variants 
(requires local.mk build operators.flux.c instead of operators.fv4.c) and they 
are a couple commits behind.  Regardless, this issue has persisted on K for 
several years.  By default, it will build log(N) subcommunicators where N is 
the problem size.  Weak scaling experiments has shown comm_split/dup times 
growing consistently with worst case complexity.  That being said, AMR codes 
might rebuild the sub communicators as they regrid/adapt.  








> On Nov 7, 2017, at 8:33 AM, Gilles Gouaillardet 
> <gilles.gouaillar...@gmail.com> wrote:
> 
> Samuel,
> 
> The default MPI library on the K computer is Fujitsu MPI, and yes, it
> is based on Open MPI.
> /* fwiw, an alternative is RIKEN MPI, and it is MPICH based */
> From a support perspective, this should be reported to the HPCI
> helpdesk http://www.hpci-office.jp/pages/e_support
> 
> As far as i understand, Fujitsu MPI currently available on K is not
> based on the latest Open MPI.
> I suspect most of the time is spent trying to find the new
> communicator ID (CID) when a communicator is created (vs figuring out
> the new ranks)
> iirc, on older versions of Open MPI, that was implemented with as many
> MPI_Allreduce(MPI_MAX) as needed to figure out the smallest common
> unused CID for the newly created communicator.
> 
> So if you MPI_Comm_dup(MPI_COMM_WORLD) n times at the beginning of
> your program, only one MPI_Allreduce() should be involved per
> MPI_Comm_dup().
> But if you do the same thing in the middle of your run, and after each
> rank has a different lower unused CID, the performances can be (much)
> worst.
> If i understand correctly your description of the issue, that would
> explain the performance discrepancy between static vs dynamic
> communicator creation time.
> 
> fwiw, this part has been (highly) improved in the latest releases of Open MPI.
> 
> If your benchmark is available for download, could you please post a link ?
> 
> 
> Cheers,
> 
> Gilles
> 
> On Wed, Nov 8, 2017 at 12:04 AM, Samuel Williams <swwilli...@lbl.gov> wrote:
>> Some of my collaborators have had issues with one of my benchmarks at high 
>> concurrency (82K MPI procs) on the K machine in Japan.  I believe K uses 
>> OpenMPI and the issues has been tracked to time in MPI_Comm_dup/Comm_split 
>> increasing quadratically with process concurrency.  At 82K processes, each 
>> call to dup/split is taking 15s to complete.  These high times restrict 
>> comm_split/dup to be used statically (at the beginning) and not dynamically 
>> in an application.
>> 
>> I had a similar issue a few years ago on ANL/Mira/MPICH where they called 
>> qsort to split the ranks.  Although qsort/quicksort has ideal computational 
>> complexity of O(PlogP)  [P is the number of MPI ranks], it can have worst 
>> case complexity of O(P^2)... at 82K, P/logP is a 5000x slowdown.
>> 
>> Can you confirm whether qsort (or the like) is (still) used in these 
>> routines in OpenMPI?  It seems mergesort (worst case complexity of PlogP) 
>> would be a more scalable approach.  I have not observed this issue on the 
>> Cray MPICH implementation and the Mira MPICH issues has since been resolved.
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/devel
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Reply via email to