Thanks Jeff, I understand better the different cases and how to choose as a
function of the situation


2014-03-17 16:31 GMT+01:00 Jeff Squyres (jsquyres) <jsquy...@cisco.com>:

> On Mar 16, 2014, at 10:24 PM, christophe petit <
> christophe.peti...@gmail.com> wrote:
>
> > I am studying the optimization strategy when the number of communication
> functions in a codeis high.
> >
> > My courses on MPI say two things for optimization which are
> contradictory :
> >
> > 1*) You have to use temporary message copy to allow non-blocking sending
> and uncouple the sending and receiving
>
> There's a lot of schools of thought here, and the real answer is going to
> depend on your application.
>
> If the message is "short" (and the exact definition of "short" depends on
> your platform -- it varies depending on your CPU, your memory, your
> CPU/memory interconnect, ...etc.), then copying to a pre-allocated bounce
> buffer is typically a good idea.  That lets you keep using your "real"
> buffer and not have to wait until communication is done.
>
> For "long" messages, the equation is a bit different.  If "long" isn't
> "enormous", you might be able to have N buffers available, and simply work
> on 1 of them at a time in your main application and use the others for
> ongoing non-blocking communication.  This is sometimes called "shadow"
> copies, or "ghost" copies.
>
> Such shadow copies are most useful when you receive something each
> iteration, for example.  For example, something like this:
>
>   buffer[0] = malloc(...);
>   buffer[1] = malloc(...);
>   current = 0;
>   while (still_doing_iterations) {
>       MPI_Irecv(buffer[current], ..., &req);
>       /// work on buffer[current - 1]
>       MPI_Wait(req, MPI_STATUS_IGNORE);
>       current = 1 - current;
>   }
>
> You get the idea.
>
> > 2*) Avoid using temporary message copy because the copy will add extra
> cost on execution time.
>
> It will, if the memcpy cost is significant (especially compared to the
> network time to send it).  If the memcpy is small/insignificant, then don't
> worry about it.
>
> You'll need to determine where this crossover point is, however.
>
> Also keep in mind that MPI and/or the underlying network stack will likely
> be doing these kinds of things under the covers for you.  Indeed, if you
> send short messages -- even via MPI_SEND -- it may return "immediately",
> indicating that MPI says it's safe for you to use the send buffer.  But
> that doesn't mean that the message has even actually left the current
> server and gone out onto the network yet (i.e., some other layer below you
> may have just done a memcpy because it was a short message, and the
> processing/sending of that message is still ongoing).
>
> > And then, we are adviced to do :
> >
> > - replace MPI_SEND by MPI_SSEND (synchroneous blocking sending) : it is
> said that execution is divided by a factor 2
>
> This very, very much depends on your application.
>
> MPI_SSEND won't return until the receiver has started to receive the
> message.
>
> For some communication patterns, putting in this additional level of
> synchronization is helpful -- it keeps all MPI processes in tighter
> synchronization and you might experience less jitter, etc.  And therefore
> overall execution time is faster.
>
> But for others, it adds unnecessary delay.
>
> I'd say it's an over-generalization that simply replacing MPI_SEND with
> MPI_SSEND always reduces execution time by 2.
>
> > - use MPI_ISSEND and MPI_IRECV with MPI_WAIT function to synchronize
> (synchroneous non-blocking sending) : it is said that execution is divided
> by a factor 3
>
> Again, it depends on the app.  Generally, non-blocking communication is
> better -- *if your app can effectively overlap communication and
> computation*.
>
> If your app doesn't take advantage of this overlap, then you won't see
> such performance benefits.  For example:
>
>    MPI_Isend(buffer, ..., req);
>    MPI_Wait(&req, ...);
>
> Technically, the above uses ISEND and WAIT... but it's actually probably
> going to be *slower* than using MPI_SEND because you've made multiple
> function calls with no additional work between the two -- so the app didn't
> effectively overlap the communication with any local computation.  Hence:
> no performance benefit.
>
> > So what's the best optimization ? Do we have to use temporary message
> copy or not and if yes, what's the case for ?
>
> As you can probably see from my text above, the answer is: it depends.  :-)
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Reply via email to