George can speak more definitively about this.

In general, our "tuned" coll component (plugin) does exactly these kinds of determinations to figure out which algorithm to use at runtime. Not only are communicator process counts involved, but also size of message is considered. I count 5 different all2all algorithms in our tuned module (but George will have to speak about how each one is chosen). U. Tennessee published some papers on their method; they basically hard-coded minimized selection tables based on oodles of runs and empirical data.

If you'd like to look at the code, it's here in the tree:

    
https://svn.open-mpi.org/source/xref/ompi_1.3/ompi/mca/coll/tuned/coll_tuned_alltoall.c
    (v1.3 release branch)

    
https://svn.open-mpi.org/source/xref/ompi-trunk/ompi/mca/coll/tuned/coll_tuned_alltoall.c
    (development trunk)

I don't think there's much difference between the two, but they can drift since v1.3 is the release branch and the trunk is active development.


On Apr 12, 2009, at 3:20 PM, Tom Rosmond wrote:

I am curious about the algorithm(s) used in the OpenMPI implementations
of the all2all and all2allv.  As many of you know, there are alternate
algorithms for all2all type operations, such as that of Plimpton, et al (2006), that basically exchange latency costs for bandwidth costs, which
pays big dividends for large processor numbers, e.g. 100's or 1000's.
Does OpenMPI, or any other MPI distributions, test for processor count
and switch to such an all2all algorithm at some point?  I realize the
switchover point would be very much a function of the architecture, and
so could be a risky decision in some cases.  Nevertheless, has it been
considered?



--
Jeff Squyres
Cisco Systems

Reply via email to