Re: [OMPI users] MPI piggyback mechanism

Oleg Morajko Tue, 5 Feb 2008 11:03:44 -0500

Thank you Josh, that's interesting. I'll have a look.
--Oleg

On Feb 5, 2008 2:39 PM, Josh Hursey <jjhur...@open-mpi.org> wrote:


> Oleg,
>
> Interesting work. You mentioned late in your email that you believe
> that adding support for piggybacking to the MPI standard would be the
> best solution. As you may know, the MPI Forum has reconvened and there
> is a working group for Fault Tolerance. This working group is
> discussing a piggybacking interface proposal for the standard, amongst
> other things. If you are interested in contributing to this
> conversation you can find the mailing list here:
>  http://lists.cs.uiuc.edu/mailman/listinfo/mpi3-ft
>
> Best,
> Josh
>
> On Feb 5, 2008, at 4:58 AM, Oleg Morajko wrote:
>
> > Hi,
> >
> > I've been working on MPI piggyback technique as a part of my PhD work.
> >
> > Although MPI does not provide a native support, there are several
> > different
> > solutions to transmit piggyback data over every MPI communication.
> > You may
> > find a brief overview in papers [1, 2]. This includes copying the
> > original
> > message and the extra data to a bigger buffer, sending additional
> > message or
> > changing the sendtype to a dynamically created wrapper datatype that
> > contains a pointer to the original data and the piggyback data. I
> > have tried
> > all mechanisms and they work, but considering the overhead, there is
> > no "the
> > best" technique that outperforms the others in all scenarios. Jeff
> > Squyres
> > had interesting comments on this subject before (in this mailing
> > list).
> >
> > Finally after some benchmarking, I have implemented *a *hybrid
> > technique
> > that combines existing mechanisms. For small, point-to-point messages
> > datatype wrapping seems to be the less intrusive, at least considering
> > OpenMPI implementation of derived datatypes. For large, point-to-point
> > messages, experiments confirmed that sending an additional message
> > is much
> > cheaper than wrapping (and besides the intrusion is small as we are
> > already
> > sending a large message). Moreover, the implementation may
> > interleave the
> > original send with an asynchronous send of piggyback data. This
> > optimization
> > partially hides the latency of additional send and lowers overall
> > intrusion.
> > The  same criteria can be applied for collective operations, except
> > barrier
> > and reduce operations. As the former does not transmit any data and
> > the
> > latter transforms the data, the only solution is to send additional
> > messages.
> >
> > There is a penalty of course. Especially for collective operations
> > with very
> > small messages the intrusion may reach 15% and that's a lot. It than
> > decreases down to 0.1% for bigger messages, but anyway it's still
> > there. I
> > don't know what are your requirements/expectations for that issue.
> > The only
> > work that reported lower overheads is [3] but they added native
> > piggyback
> > support by changing underlying MPI implementation.
> >
> > I think the best possible option is to add piggyback support for MPI
> > as a
> > part of the standard. A growing number of runtime tools use this
> > functionality for multiple reasons and certainly PMPI itself is not
> > enough.
> > References of interest:
> >
> >   -
> >
> >   [1] Shende, S., Malony, A., Morris, A., Wolf, F. "Performance
> >   Profiling Overhead Compensation for MPI Programs". 12th EuroPVM-MPI
> >   Conference, LNCS, vol. 3666, pp. 359-367, 2005.  They review various
> >   techniques and  come up with datatype wrapping.
> >
> >   -
> >
> >   [2] Schulz, M., "Extracting Critical Path Graphs from MPI
> >   Applications". Cluster Computing 2005, IEEE International, pp. 1-10,
> >   September 2005. They use datatype wrapping.
> >   - [3] Jeffrey Vetter, "Dynamic Statistical Profiling of
> > Communication
> >   Activity in Distributed Applications". They add support for
> > piggyback at MPI
> >   implementation level and report very low overheads (no surprise).
> >
> > Regards,
> > Oleg Morajko
> >
> >
> > On Feb 1, 2008 5:08 PM, Aurélien Bouteiller <boute...@eecs.utk.edu>
> > wrote:
> >
> >> I don't know of any work in that direction for now. Indeed, we plan
> >> to
> >> eventually integrate at least causal message logging in the pml-v,
> >> which also includes piggybacking. Therefore we are open for
> >> collaboration with you on this matter. Please let us know :)
> >>
> >> Aurelien
> >>
> >>
> >>
> >> Le 1 févr. 08 à 09:51, Thomas Ropars a écrit :
> >>
> >>> Hi,
> >>>
> >>> I'm currently working on optimistic message logging and I would like
> >>> to
> >>> implement an optimistic message logging protocol in OpenMPI.
> >>> Optimistic
> >>> message logging protocols piggyback information about dependencies
> >>> between processes on the application messages to be able to find a
> >>> consistent global state after a failure. That's why I'm interested
> >>> in
> >>> the problem of piggybacking information on MPI messages.
> >>>
> >>> Is there some works on this problem at the moment ?
> >>> Has anyone already implemented some mechanisms in OpenMPI to
> >>> piggyback
> >>> data on MPI messages?
> >>>
> >>> Regards,
> >>>
> >>> Thomas
> >>>
> >>> Oleg Morajko wrote:
> >>>> Hi,
> >>>>
> >>>> I'm developing a causality chain tracking library and need a
> >>>> mechanism
> >>>> to attach an extra data to every MPI message, so called piggyback
> >>>> mechanism.
> >>>>
> >>>> As far as I know there are a few solutions to this problem from
> >>>> which
> >>>> the two fundamental ones are the following:
> >>>>
> >>>>   * Dynamic datatype wrapping - if a user MPI_Send, let's say 1024
> >>>>     doubles, the wrapped send call implementation dynamically
> >>>>     creates a derived datatype that is a structure composed of a
> >>>>     pointer to 1024 doubles and extra fields to be piggybacked. The
> >>>>     datatype is constructed with absolute addresses to avoid
> >>>> copying
> >>>>     the original buffer. The receivers side creates the equivalent
> >>>>     datatype to receive the original data and extra data. The
> >>>>     performance of this solution depends on the how good is derived
> >>>>     data type handling, but seems to be lightweight.
> >>>>
> >>>>   * Sending extra data in a separate message -- seems this can have
> >>>>     much more significant overhead
> >>>>
> >>>> Do you know any other portable solution?
> >>>>
> >>>> I have implemented the first solution for P2P operations and it
> >>>> works
> >>>> pretty well. However there are problems with collective operations.
> >>>> There are 2 classes of collective calls that are problematic:
> >>>>
> >>>>  1. Single receiver calls, like MPI_Gather. The sender tasks in
> >>>>     gather can be handled in the same way as a normal send, a data
> >>>>     item is wrapped and extra data is piggybacked with the message.
> >>>>     The problem is at the receiver side when a root gathers N data
> >>>>     items that must be received in an array big enough to receive
> >>>>     all items strided by datatype extent.
> >>>>
> >>>>     In particular, it seems impossible to construct a datatype that
> >>>>     contains data item and extra data (i.e. structure type with
> >>>>     absolute addresses) AND make an array of these datatypes
> >>>>     separated by a fixed extent. For example: data item to receive
> >>>>     from every process is a vector of 1024 doubles. Extra data is a
> >>>>     single integer. User provides a receive buffer with place for N
> >>>>     * 1024 * double. The library allocates an array of N integers
> >>>> to
> >>>>     receive piggybacked data. How to construct a datatype that can
> >>>>     be used to receive data in MPI_Gather?
> >>>>
> >>>>  2. MPI_Reduce calls. There is no problem with datatypes as the
> >>>>     receiver gets the single data item and not an array as in
> >>>>     previous case. The problem is the reduction operator itself
> >>>>     (MPI_Op) because these operators do not work with wrapped data
> >>>>     types. So I can create a new operator to recognize the wrapped
> >>>>     data type that extracts the original data (skipping extra data)
> >>>>     and performs the original reduction. The point is how to invoke
> >>>>     the original reduction on an existing datatype. I have found
> >>>>     that Open MPI calls internally ompi_op_reduce(op, inbuf, rbuf,
> >>>>     count, dtype) this solves a problem. However this makes the
> >>>> code
> >>>>     MPI-implementation dependent. Any idea on more portable
> >>>> options?
> >>>>
> >>>>
> >>>> Thank you in advance for any comment.
> >>>>
> >>>> --Oleg
> >>>>
> >>>>
> >>>>
> >>
> ------------------------------------------------------------------------
> >>>>
> >>>> _______________________________________________
> >>>> users mailing list
> >>>> us...@open-mpi.org
> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>> _______________________________________________
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >>
> >> --
> >> Dr. Aurélien Bouteiller
> >> Sr. Research Associate - Innovative Computing Laboratory
> >> Suite 350, 1122 Volunteer Boulevard
> >> Knoxville, TN 37996
> >> 865 974 6321
> >>
> >>
> >>
> >>
> >>
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] MPI piggyback mechanism

Reply via email to