Folks, fwiw, i wrote a simple test program that mimicks Open MPI qsort usage, and a given qsort invokation on 1 million keys always take less than one second on K.
So qsort() is unlikely to be blamed here. Cheers, Gilles On Wed, Nov 8, 2017 at 3:24 AM, George Bosilca <bosi...@icl.utk.edu> wrote: > Samuel, > > We did some work in this area last year, but we never pursued it to push the > code back in OMPI. We would be interested to give a more well-defined > context to a followup study, especially if there is a large scale app as a > target. Ping me back if you, or your collaborators, share the same interest. > > George. > > > > On Tue, Nov 7, 2017 at 11:55 AM, Edgar Gabriel <egabr...@central.uh.edu> > wrote: >> >> My guess would be that both aspects (sorting + CID allocation) could be a >> problem. >> >> There was a loong time back an effort to convert the sequence of allgather >> + qsort into a distributed sort (based on a paper by Moody et. al. where he >> demonstrated the benefits of this approach). We didn't get unfortunately to >> implement this part due to timing constraints. >> >> Also, we used to have a block allocation algorithm for CIDs that did save >> quite a bit in communication costs, but we had to abandon it because the >> implementation was wasting CIDs by not properly reusing them. It is fairly >> clear on how to implement it correctly, it just hasn't been done for a lack >> of resources. >> >> Maybe we should revisit those items at one point in time. >> Thanks >> Edgar >> >> >> On 11/7/2017 10:23 AM, George Bosilca wrote: >> >> Samuel, >> >> You are right, we use qsort to sort the keys, but the qsort only applies >> on participants with the same color. So while the complexity of the qsort >> might reach bottom only when most of the processes participate with the same >> color. >> >> What I think is OMPI problem in this are is the selection of the next cid >> for the newly created communicator. We are doing the selection of the cid on >> the original communicator, and this basically counts for a significant >> increase in the duration, as will need to iterate a longer to converge to a >> common cid. >> >> We haven't made any improvement in this area for the last few years, we >> simply transformed the code to use non-blocking communications instead of >> blocking, but this has little impact on the performance of the split itself. >> >> George. >> >> >> On Tue, Nov 7, 2017 at 10:52 AM, Samuel Williams <swwilli...@lbl.gov> >> wrote: >>> >>> I'll ask my collaborators if they've submitted a ticket. >>> (they have the accounts; built the code; ran the code; observed the >>> issues) >>> >>> I believe the issue on MPICH was a qsort issue and not a Allreduce issue. >>> When this is coupled with the fact that it looked like qsort is called in >>> ompi_comm_split >>> (https://github.com/open-mpi/ompi/blob/a7a30424cba6482c97f8f2f7febe53aaa180c91e/ompi/communicator/comm.c), >>> I wanted to raise the issue so that it may be investigated to understand >>> whether users can naively blunder into worst case computational complexity >>> issues. >>> >>> We've been running hpgmg-fv (not -fe). They were using the flux variants >>> (requires local.mk build operators.flux.c instead of operators.fv4.c) and >>> they are a couple commits behind. Regardless, this issue has persisted on K >>> for several years. By default, it will build log(N) subcommunicators where >>> N is the problem size. Weak scaling experiments has shown comm_split/dup >>> times growing consistently with worst case complexity. That being said, AMR >>> codes might rebuild the sub communicators as they regrid/adapt. >>> >>> >>> >>> >>> >>> >>> >>> >>> > On Nov 7, 2017, at 8:33 AM, Gilles Gouaillardet >>> > <gilles.gouaillar...@gmail.com> wrote: >>> > >>> > Samuel, >>> > >>> > The default MPI library on the K computer is Fujitsu MPI, and yes, it >>> > is based on Open MPI. >>> > /* fwiw, an alternative is RIKEN MPI, and it is MPICH based */ >>> > From a support perspective, this should be reported to the HPCI >>> > helpdesk http://www.hpci-office.jp/pages/e_support >>> > >>> > As far as i understand, Fujitsu MPI currently available on K is not >>> > based on the latest Open MPI. >>> > I suspect most of the time is spent trying to find the new >>> > communicator ID (CID) when a communicator is created (vs figuring out >>> > the new ranks) >>> > iirc, on older versions of Open MPI, that was implemented with as many >>> > MPI_Allreduce(MPI_MAX) as needed to figure out the smallest common >>> > unused CID for the newly created communicator. >>> > >>> > So if you MPI_Comm_dup(MPI_COMM_WORLD) n times at the beginning of >>> > your program, only one MPI_Allreduce() should be involved per >>> > MPI_Comm_dup(). >>> > But if you do the same thing in the middle of your run, and after each >>> > rank has a different lower unused CID, the performances can be (much) >>> > worst. >>> > If i understand correctly your description of the issue, that would >>> > explain the performance discrepancy between static vs dynamic >>> > communicator creation time. >>> > >>> > fwiw, this part has been (highly) improved in the latest releases of >>> > Open MPI. >>> > >>> > If your benchmark is available for download, could you please post a >>> > link ? >>> > >>> > >>> > Cheers, >>> > >>> > Gilles >>> > >>> > On Wed, Nov 8, 2017 at 12:04 AM, Samuel Williams <swwilli...@lbl.gov> >>> > wrote: >>> >> Some of my collaborators have had issues with one of my benchmarks at >>> >> high concurrency (82K MPI procs) on the K machine in Japan. I believe K >>> >> uses OpenMPI and the issues has been tracked to time in >>> >> MPI_Comm_dup/Comm_split increasing quadratically with process >>> >> concurrency. >>> >> At 82K processes, each call to dup/split is taking 15s to complete. >>> >> These >>> >> high times restrict comm_split/dup to be used statically (at the >>> >> beginning) >>> >> and not dynamically in an application. >>> >> >>> >> I had a similar issue a few years ago on ANL/Mira/MPICH where they >>> >> called qsort to split the ranks. Although qsort/quicksort has ideal >>> >> computational complexity of O(PlogP) [P is the number of MPI ranks], it >>> >> can >>> >> have worst case complexity of O(P^2)... at 82K, P/logP is a 5000x >>> >> slowdown. >>> >> >>> >> Can you confirm whether qsort (or the like) is (still) used in these >>> >> routines in OpenMPI? It seems mergesort (worst case complexity of PlogP) >>> >> would be a more scalable approach. I have not observed this issue on the >>> >> Cray MPICH implementation and the Mira MPICH issues has since been >>> >> resolved. >>> >> >>> >> >>> >> _______________________________________________ >>> >> devel mailing list >>> >> devel@lists.open-mpi.org >>> >> https://lists.open-mpi.org/mailman/listinfo/devel >>> > _______________________________________________ >>> > devel mailing list >>> > devel@lists.open-mpi.org >>> > https://lists.open-mpi.org/mailman/listinfo/devel >>> >>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org >>> https://lists.open-mpi.org/mailman/listinfo/devel >> >> >> >> >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/devel > > > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/devel _______________________________________________ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel