My guess would be that both aspects (sorting + CID allocation) could be
a problem.
There was a loong time back an effort to convert the sequence of
allgather + qsort into a distributed sort (based on a paper by Moody et.
al. where he demonstrated the benefits of this approach). We didn't get
unfortunately to implement this part due to timing constraints.
Also, we used to have a block allocation algorithm for CIDs that did
save quite a bit in communication costs, but we had to abandon it
because the implementation was wasting CIDs by not properly reusing
them. It is fairly clear on how to implement it correctly, it just
hasn't been done for a lack of resources.
Maybe we should revisit those items at one point in time.
Thanks
Edgar
On 11/7/2017 10:23 AM, George Bosilca wrote:
Samuel,
You are right, we use qsort to sort the keys, but the qsort only
applies on participants with the same color. So while the complexity
of the qsort might reach bottom only when most of the processes
participate with the same color.
What I think is OMPI problem in this are is the selection of the next
cid for the newly created communicator. We are doing the selection of
the cid on the original communicator, and this basically counts for a
significant increase in the duration, as will need to iterate a longer
to converge to a common cid.
We haven't made any improvement in this area for the last few years,
we simply transformed the code to use non-blocking communications
instead of blocking, but this has little impact on the performance of
the split itself.
George.
On Tue, Nov 7, 2017 at 10:52 AM, Samuel Williams <swwilli...@lbl.gov
<mailto:swwilli...@lbl.gov>> wrote:
I'll ask my collaborators if they've submitted a ticket.
(they have the accounts; built the code; ran the code; observed
the issues)
I believe the issue on MPICH was a qsort issue and not a Allreduce
issue. When this is coupled with the fact that it looked like
qsort is called in ompi_comm_split
(https://github.com/open-mpi/ompi/blob/a7a30424cba6482c97f8f2f7febe53aaa180c91e/ompi/communicator/comm.c
<https://github.com/open-mpi/ompi/blob/a7a30424cba6482c97f8f2f7febe53aaa180c91e/ompi/communicator/comm.c>),
I wanted to raise the issue so that it may be investigated to
understand whether users can naively blunder into worst case
computational complexity issues.
We've been running hpgmg-fv (not -fe). They were using the flux
variants (requires local.mk <http://local.mk> build
operators.flux.c instead of operators.fv4.c) and they are a couple
commits behind. Regardless, this issue has persisted on K for
several years. By default, it will build log(N) subcommunicators
where N is the problem size. Weak scaling experiments has shown
comm_split/dup times growing consistently with worst case
complexity. That being said, AMR codes might rebuild the sub
communicators as they regrid/adapt.
> On Nov 7, 2017, at 8:33 AM, Gilles Gouaillardet
<gilles.gouaillar...@gmail.com
<mailto:gilles.gouaillar...@gmail.com>> wrote:
>
> Samuel,
>
> The default MPI library on the K computer is Fujitsu MPI, and
yes, it
> is based on Open MPI.
> /* fwiw, an alternative is RIKEN MPI, and it is MPICH based */
> From a support perspective, this should be reported to the HPCI
> helpdesk http://www.hpci-office.jp/pages/e_support
<http://www.hpci-office.jp/pages/e_support>
>
> As far as i understand, Fujitsu MPI currently available on K is not
> based on the latest Open MPI.
> I suspect most of the time is spent trying to find the new
> communicator ID (CID) when a communicator is created (vs
figuring out
> the new ranks)
> iirc, on older versions of Open MPI, that was implemented with
as many
> MPI_Allreduce(MPI_MAX) as needed to figure out the smallest common
> unused CID for the newly created communicator.
>
> So if you MPI_Comm_dup(MPI_COMM_WORLD) n times at the beginning of
> your program, only one MPI_Allreduce() should be involved per
> MPI_Comm_dup().
> But if you do the same thing in the middle of your run, and
after each
> rank has a different lower unused CID, the performances can be
(much)
> worst.
> If i understand correctly your description of the issue, that would
> explain the performance discrepancy between static vs dynamic
> communicator creation time.
>
> fwiw, this part has been (highly) improved in the latest
releases of Open MPI.
>
> If your benchmark is available for download, could you please
post a link ?
>
>
> Cheers,
>
> Gilles
>
> On Wed, Nov 8, 2017 at 12:04 AM, Samuel Williams
<swwilli...@lbl.gov <mailto:swwilli...@lbl.gov>> wrote:
>> Some of my collaborators have had issues with one of my
benchmarks at high concurrency (82K MPI procs) on the K machine in
Japan. I believe K uses OpenMPI and the issues has been tracked
to time in MPI_Comm_dup/Comm_split increasing quadratically with
process concurrency. At 82K processes, each call to dup/split is
taking 15s to complete. These high times restrict comm_split/dup
to be used statically (at the beginning) and not dynamically in an
application.
>>
>> I had a similar issue a few years ago on ANL/Mira/MPICH where
they called qsort to split the ranks. Although qsort/quicksort
has ideal computational complexity of O(PlogP) [P is the number
of MPI ranks], it can have worst case complexity of O(P^2)... at
82K, P/logP is a 5000x slowdown.
>>
>> Can you confirm whether qsort (or the like) is (still) used in
these routines in OpenMPI? It seems mergesort (worst case
complexity of PlogP) would be a more scalable approach. I have
not observed this issue on the Cray MPICH implementation and the
Mira MPICH issues has since been resolved.
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>> https://lists.open-mpi.org/mailman/listinfo/devel
<https://lists.open-mpi.org/mailman/listinfo/devel>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
> https://lists.open-mpi.org/mailman/listinfo/devel
<https://lists.open-mpi.org/mailman/listinfo/devel>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/devel
<https://lists.open-mpi.org/mailman/listinfo/devel>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel