Thanks Jeff, I understand better the different cases and how to choose as a function of the situation
2014-03-17 16:31 GMT+01:00 Jeff Squyres (jsquyres) <jsquy...@cisco.com>: > On Mar 16, 2014, at 10:24 PM, christophe petit < > christophe.peti...@gmail.com> wrote: > > > I am studying the optimization strategy when the number of communication > functions in a codeis high. > > > > My courses on MPI say two things for optimization which are > contradictory : > > > > 1*) You have to use temporary message copy to allow non-blocking sending > and uncouple the sending and receiving > > There's a lot of schools of thought here, and the real answer is going to > depend on your application. > > If the message is "short" (and the exact definition of "short" depends on > your platform -- it varies depending on your CPU, your memory, your > CPU/memory interconnect, ...etc.), then copying to a pre-allocated bounce > buffer is typically a good idea. That lets you keep using your "real" > buffer and not have to wait until communication is done. > > For "long" messages, the equation is a bit different. If "long" isn't > "enormous", you might be able to have N buffers available, and simply work > on 1 of them at a time in your main application and use the others for > ongoing non-blocking communication. This is sometimes called "shadow" > copies, or "ghost" copies. > > Such shadow copies are most useful when you receive something each > iteration, for example. For example, something like this: > > buffer[0] = malloc(...); > buffer[1] = malloc(...); > current = 0; > while (still_doing_iterations) { > MPI_Irecv(buffer[current], ..., &req); > /// work on buffer[current - 1] > MPI_Wait(req, MPI_STATUS_IGNORE); > current = 1 - current; > } > > You get the idea. > > > 2*) Avoid using temporary message copy because the copy will add extra > cost on execution time. > > It will, if the memcpy cost is significant (especially compared to the > network time to send it). If the memcpy is small/insignificant, then don't > worry about it. > > You'll need to determine where this crossover point is, however. > > Also keep in mind that MPI and/or the underlying network stack will likely > be doing these kinds of things under the covers for you. Indeed, if you > send short messages -- even via MPI_SEND -- it may return "immediately", > indicating that MPI says it's safe for you to use the send buffer. But > that doesn't mean that the message has even actually left the current > server and gone out onto the network yet (i.e., some other layer below you > may have just done a memcpy because it was a short message, and the > processing/sending of that message is still ongoing). > > > And then, we are adviced to do : > > > > - replace MPI_SEND by MPI_SSEND (synchroneous blocking sending) : it is > said that execution is divided by a factor 2 > > This very, very much depends on your application. > > MPI_SSEND won't return until the receiver has started to receive the > message. > > For some communication patterns, putting in this additional level of > synchronization is helpful -- it keeps all MPI processes in tighter > synchronization and you might experience less jitter, etc. And therefore > overall execution time is faster. > > But for others, it adds unnecessary delay. > > I'd say it's an over-generalization that simply replacing MPI_SEND with > MPI_SSEND always reduces execution time by 2. > > > - use MPI_ISSEND and MPI_IRECV with MPI_WAIT function to synchronize > (synchroneous non-blocking sending) : it is said that execution is divided > by a factor 3 > > Again, it depends on the app. Generally, non-blocking communication is > better -- *if your app can effectively overlap communication and > computation*. > > If your app doesn't take advantage of this overlap, then you won't see > such performance benefits. For example: > > MPI_Isend(buffer, ..., req); > MPI_Wait(&req, ...); > > Technically, the above uses ISEND and WAIT... but it's actually probably > going to be *slower* than using MPI_SEND because you've made multiple > function calls with no additional work between the two -- so the app didn't > effectively overlap the communication with any local computation. Hence: > no performance benefit. > > > So what's the best optimization ? Do we have to use temporary message > copy or not and if yes, what's the case for ? > > As you can probably see from my text above, the answer is: it depends. :-) > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >