Thank you Jeff for your opinion. It was really helpful.

Concerning reduce operation in case of small messages: it is possible to
wrap also a reduction operator
and make it work with wrapped data. This operator could reduce only the
original data and simply collect the piggybacked data (instead of reducing
it). And as you say for big messages, I could be more appropriate to send
separate messages asynchronously to root.

Thanks again,
--Oleg

On 11/1/07, Jeff Squyres <jsquy...@cisco.com> wrote:
>
> On Oct 31, 2007, at 5:52 PM, Oleg Morajko wrote:
>
> > Let me clarify the context of the problem. I'm implementing a MPI
> > piggyback mechanism that should allow for attaching extra data to
> > any MPI message. The idea is to wrap MPI communication calls with
> > PMPI interface (or with dynamic instrumentation or whatsoever) and
> > add/receive extra data in a non expensive way. The best solution I
> > have found so far is dynamic datatype wrapping. That is when a user
> > calls MPI_Send (datatype, count) I create dynamically a new
> > structure type that contains an array [count] of datatype and extra
> > data. To avoid copying the original send buffer I use absolute
> > addresses to define displacaments in the structure. This works fine
> > for all P2P calls and MPI_Bcast. And definitely it has performance
> > benefits when compared to copying bufferers or sending an
> > additional message in a different communicator. Or would you expect
> > something different?
> >
> > The only problem are collective calls like MPI_Gather when a root
> > process receives an array of data items. There is no problem to
> > wrap the message on the sender side (for each task), but the
> > question is how to define a datatype that points both to original
> > receive buffer and extra buffer for piggybacked data AND has an
> > adecuate extent to work as an array element.
> >
> > The real problem is that a structure datatype { original data,
> > extra data} does not have a constant displacement between the
> > original data and extra data. Eg. consider original data = receive
> > buffer in MPI_Gather and extra data is an array of ints somewhere
> > in memory). So it cannot be directly used as an array datatype.
>
> I guess I don't see why this is a problem...?  If you're already
> making a specific datatype for this communication, MPI's datatype
> primitives are flexible enough to allow what you describe.  Keep in
> mind that you can nest datatypes (e.g., with TYPE_CREATE_STRUCT).
>
> But for collectives, I think you need to decide exactly what
> information you want to generate / save.  Specifically, if you're
> piggybacking on collectives, you are stuck using the same
> communication pattern as that collective.  I.e., if the application
> calls MPI_REDUCE with MPI_SUM, I imagine you'll have a difficult time
> piggybacking your data on that reduction without it being summed
> across all the processes.
>
> There are a few other canonical solutions to the "need to save extra
> data about every communication" technique:
>
> - for small messages, do what you're doing: a) make a new/specific
> datatype for p2p messages or b) memcpy the user+extra data into a
> small contiguous buffer and then just send that (and memcpy out on
> the receiver).  If making datatypes is cheap in MPI, then a) is
> effectively the same as b), and potentially more optimized/tuned.
>
> - for large messages, don't bother making a new datatype -- just send
> around another message with your extra data.  The performance impact
> will be minimal because it's already a long message; don't force the
> MPI do to additional copies with a non-contiguous datatype if you can
> avoid it.
>
> - for collectives, if you can't piggyback (e.g., REDUCE with SUM and
> others), just send around another short message.  Yes, you'll take a
> performance hit for this.
>
> - depending on what data you're piggybacking / collecting, it may be
> possible to implement a "lazy" collection scheme in the meta/PMPI
> layer.  E.g., for when you send separate messages with your meta
> data, always use non-blocking sends.  The receiver PMPI layer can
> lazily collect this data and match it with application sends/receives
> after the fact (i.e., don't be trapped into thinking that you have to
> do the match exactly when the application data is actually sent or
> received -- it could be done after that).
>
> Hope that helps.
>
>
> > Any solution? It could be complex, I don't mind ;)
> >
> >
> > On 11/1/07, George Bosilca <bosi...@eecs.utk.edu> wrote: The MPI
> > standard defines the upper bound and the upper bound for
> > similar problems. However, even with all the functions in the MPI
> > standard we cannot describe all types of data. There is always a
> > solution, but sometimes one has to ask if the performance gain is
> > worth the complexity introduced.
> >
> >
> > As I said there is always a solution. In fact there are 2 solution,
> > one somehow optimal the other ... as bad as you can imagine.
> >
> > The bad approach:
> >   1. Use an MPI_Type_struct to create exactly what you want, element
> > by element (i.e single pair). This can work in all cases.  2. If
> > the sizeof(int) == sizeof(double) then the displacement inside
> > each tuple (double_i, int_i) is constant. Therefore, you can start by
> > creating one "single element" type and then use for each send the
> > correct displacement in the array (added to the send buffer,
> > respectively to the receive one).
> >
> >    george.
> >
> > On Oct 31, 2007, at 1:40 PM, Oleg Morajko wrote:
> >
> > > Hello,
> > >
> > > I have the following problem. There areI two arrays somewere in the
> > > program:
> > >
> > > double weights [MAX_SIZE];
> > > ...
> > > int       values [MAX_SIZE];
> > > ...
> > >
> > > I need to be able to send a single pair { weights [i], values [i] }
> > > with a single MPI_Send call Or receive it directly into both arrays
> > > at at given index i. How can I define a datatype that spans this
> > > pair over both arrays?
> > >
> > > The only additional constraint it the fact that the memory location
> > > of both arrays is fixed and cannot be changed and I should avoid
> > > extra copies.
> > >
> > > Is it possible?
> > >
> > > Any help welcome,
> > > Oleg Morajko
> > >
> > >
> > > _______________________________________________
> > > users mailing list
> > > us...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to