Folks,

fwiw, i wrote a simple test program that mimicks Open MPI qsort usage,
and a given qsort invokation on 1 million keys always take less than
one second on K.

So qsort() is unlikely to be blamed here.

Cheers,

Gilles

On Wed, Nov 8, 2017 at 3:24 AM, George Bosilca <bosi...@icl.utk.edu> wrote:
> Samuel,
>
> We did some work in this area last year, but we never pursued it to push the
> code back in OMPI. We would be interested to give a more well-defined
> context to a followup study, especially if there is a large scale app as a
> target. Ping me back if you, or your collaborators, share the same interest.
>
>   George.
>
>
>
> On Tue, Nov 7, 2017 at 11:55 AM, Edgar Gabriel <egabr...@central.uh.edu>
> wrote:
>>
>> My guess would be that both aspects (sorting + CID allocation) could be a
>> problem.
>>
>> There was a loong time back an effort to convert the sequence of allgather
>> + qsort into a distributed sort (based on a paper by Moody et. al. where he
>> demonstrated the benefits of this approach).  We didn't get unfortunately to
>> implement this part due to timing constraints.
>>
>> Also, we used to have a block allocation algorithm for CIDs that did save
>> quite a bit in communication costs, but we had to abandon it because the
>> implementation was wasting CIDs by not properly reusing them. It is fairly
>> clear on how to implement it correctly, it just hasn't been done for a lack
>> of resources.
>>
>> Maybe we should revisit those items at one point in time.
>> Thanks
>> Edgar
>>
>>
>> On 11/7/2017 10:23 AM, George Bosilca wrote:
>>
>> Samuel,
>>
>> You are right, we use qsort to sort the keys, but the qsort only applies
>> on participants with the same color. So while the complexity of the qsort
>> might reach bottom only when most of the processes participate with the same
>> color.
>>
>> What I think is OMPI problem in this are is the selection of the next cid
>> for the newly created communicator. We are doing the selection of the cid on
>> the original communicator, and this basically counts for a significant
>> increase in the duration, as will need to iterate a longer to converge to a
>> common cid.
>>
>> We haven't made any improvement in this area for the last few years, we
>> simply transformed the code to use non-blocking communications instead of
>> blocking, but this has little impact on the performance of the split itself.
>>
>>   George.
>>
>>
>> On Tue, Nov 7, 2017 at 10:52 AM, Samuel Williams <swwilli...@lbl.gov>
>> wrote:
>>>
>>> I'll ask my collaborators if they've submitted a ticket.
>>> (they have the accounts; built the code; ran the code; observed the
>>> issues)
>>>
>>> I believe the issue on MPICH was a qsort issue and not a Allreduce issue.
>>> When this is coupled with the fact that it looked like qsort is called in
>>> ompi_comm_split
>>> (https://github.com/open-mpi/ompi/blob/a7a30424cba6482c97f8f2f7febe53aaa180c91e/ompi/communicator/comm.c),
>>> I wanted to raise the issue so that it may be investigated to understand
>>> whether users can naively blunder into worst case computational complexity
>>> issues.
>>>
>>> We've been running hpgmg-fv (not -fe).  They were using the flux variants
>>> (requires local.mk build operators.flux.c instead of operators.fv4.c) and
>>> they are a couple commits behind.  Regardless, this issue has persisted on K
>>> for several years.  By default, it will build log(N) subcommunicators where
>>> N is the problem size.  Weak scaling experiments has shown comm_split/dup
>>> times growing consistently with worst case complexity.  That being said, AMR
>>> codes might rebuild the sub communicators as they regrid/adapt.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> > On Nov 7, 2017, at 8:33 AM, Gilles Gouaillardet
>>> > <gilles.gouaillar...@gmail.com> wrote:
>>> >
>>> > Samuel,
>>> >
>>> > The default MPI library on the K computer is Fujitsu MPI, and yes, it
>>> > is based on Open MPI.
>>> > /* fwiw, an alternative is RIKEN MPI, and it is MPICH based */
>>> > From a support perspective, this should be reported to the HPCI
>>> > helpdesk http://www.hpci-office.jp/pages/e_support
>>> >
>>> > As far as i understand, Fujitsu MPI currently available on K is not
>>> > based on the latest Open MPI.
>>> > I suspect most of the time is spent trying to find the new
>>> > communicator ID (CID) when a communicator is created (vs figuring out
>>> > the new ranks)
>>> > iirc, on older versions of Open MPI, that was implemented with as many
>>> > MPI_Allreduce(MPI_MAX) as needed to figure out the smallest common
>>> > unused CID for the newly created communicator.
>>> >
>>> > So if you MPI_Comm_dup(MPI_COMM_WORLD) n times at the beginning of
>>> > your program, only one MPI_Allreduce() should be involved per
>>> > MPI_Comm_dup().
>>> > But if you do the same thing in the middle of your run, and after each
>>> > rank has a different lower unused CID, the performances can be (much)
>>> > worst.
>>> > If i understand correctly your description of the issue, that would
>>> > explain the performance discrepancy between static vs dynamic
>>> > communicator creation time.
>>> >
>>> > fwiw, this part has been (highly) improved in the latest releases of
>>> > Open MPI.
>>> >
>>> > If your benchmark is available for download, could you please post a
>>> > link ?
>>> >
>>> >
>>> > Cheers,
>>> >
>>> > Gilles
>>> >
>>> > On Wed, Nov 8, 2017 at 12:04 AM, Samuel Williams <swwilli...@lbl.gov>
>>> > wrote:
>>> >> Some of my collaborators have had issues with one of my benchmarks at
>>> >> high concurrency (82K MPI procs) on the K machine in Japan.  I believe K
>>> >> uses OpenMPI and the issues has been tracked to time in
>>> >> MPI_Comm_dup/Comm_split increasing quadratically with process 
>>> >> concurrency.
>>> >> At 82K processes, each call to dup/split is taking 15s to complete.  
>>> >> These
>>> >> high times restrict comm_split/dup to be used statically (at the 
>>> >> beginning)
>>> >> and not dynamically in an application.
>>> >>
>>> >> I had a similar issue a few years ago on ANL/Mira/MPICH where they
>>> >> called qsort to split the ranks.  Although qsort/quicksort has ideal
>>> >> computational complexity of O(PlogP)  [P is the number of MPI ranks], it 
>>> >> can
>>> >> have worst case complexity of O(P^2)... at 82K, P/logP is a 5000x 
>>> >> slowdown.
>>> >>
>>> >> Can you confirm whether qsort (or the like) is (still) used in these
>>> >> routines in OpenMPI?  It seems mergesort (worst case complexity of PlogP)
>>> >> would be a more scalable approach.  I have not observed this issue on the
>>> >> Cray MPICH implementation and the Mira MPICH issues has since been 
>>> >> resolved.
>>> >>
>>> >>
>>> >> _______________________________________________
>>> >> devel mailing list
>>> >> devel@lists.open-mpi.org
>>> >> https://lists.open-mpi.org/mailman/listinfo/devel
>>> > _______________________________________________
>>> > devel mailing list
>>> > devel@lists.open-mpi.org
>>> > https://lists.open-mpi.org/mailman/listinfo/devel
>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/devel
>>
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/devel
>
>
>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Reply via email to