Re: [OMPI users] knem/openmpi performance?

Jeff Squyres (jsquyres) Fri, 12 Jul 2013 06:55:48 -0400

FWIW: a long time ago (read: many Open MPI / knem versions ago), I did a few 
benchmarks with knem vs. no knem Open MPI installations.  IIRC, I used the 
typical suspects like NetPIPE, the NPBs, etc.  There was a modest performance 
improvement (I don't remember the numbers offhand); it was a smaller 
improvement than I had hoped for -- particularly in point-to-point message 
passing latency (e.g., via NetPIPE).

Let me digress into a little background... 

The normal non-knem shared memory pattern is to copy a message from the source 
buffer in the source process to an area in shared memory.  The receiver then 
copies from the shared memory to the target buffer in its process.  For large 
messages, this process is pipelined so that the receiver doesn't have to wait 
for the whole buffer to be copied to shared memory before it starts copying out 
to the target buffer.  This is what's known as a "copy-in/copy-out" scheme -- 
you can think of it as 2 overlapping mem copies.

The knem shared memory implementation still uses the shared memory block for 
short messages, coordination, and rendezvous.  But for large messages, the 
pipelined copy-in/copy-out is replaced with a direct copy from the source 
buffer in the source process to the target buffer in the receiver process (no 
pipelining is necessary, of course).  So there's only 1 mem copy for the bulk 
of the large message.

There's an obvious difference here: the knem version uses 1 mem copy for the 
bulk of a large message, and the non-knem version uses 2 mem copies.  So why 
wouldn't the knem version kick the non-knem version's butt?

I didn't dig deeply into it, but I rationalized that Open MPI's pipelined 
shared memory copies must be pretty good.  If you view this on a timeline, it 
might look like this (skipping lots of details about the initial rendezvous, 
etc.):

Non-knem / copy-in/copy-out scheme
==================================

Sender copying to shmem                                T=N
   |----------------------------------------------------|
        |----------------------------------------------------|
     Receiver copying from shmem                           T=N+x

You can see that the completion time is T=N+x, where x is some small number.

Knem scheme
===========

Sender copying to receiver                             T=N
   |----------------------------------------------------|

The completion time here is T=N -- not T=N+x.

(disclaimer: it's been a loooong time since I've looked at the code; I don't 
remember if, in OMPI's knem scheme, the sender or the receiver does the copy).

>From these timelines, you can see that if OMPI's pipelining is good, the 
>overall performance win of an individual send/receive of knem vs. no-knem is 
>not that huge.  

Huh.  Disappointing.  :-(

BUT.

Then I expanded my benchmarking to scale up the number of MPI processes on each 
server.  *This* is where the real win is.

As you increase the number of MPI processes that are concurrently 
sending/receiving to/from each other, the "win" of knem becomes (much) more 
evident.  

In short: doing 1 memcopy consumes half the memory bandwidth of 2 mem copies.  
So when you have lots of MPI processes competing for memory bandwidth, it turns 
out that having each MPI process use half the bandwidth is a Really Good Idea.  
:-)  This allows more MPI processes to do shared memory communications before 
you hit the memory bandwidth bottleneck.

Darius Buntinas, Brice Goglin, et al. wrote an excellent paper about exactly 
this set of issues; see http://runtime.bordeaux.inria.fr/knem/.  IIRC, it was 
the "Cache-Efficient, Intranode Large-Message MPI Communication with 
MPICH2-Nemesis" paper (but that was only after a quick glance at the titles 
this morning -- it might not be exactly that paper).

On Jul 12, 2013, at 5:07 AM, Mark Dixon <m.c.di...@leeds.ac.uk> wrote:

> Hi,
> 
> I'm taking a look at knem, to see if it improves the performance of any 
> applications on our QDR InfiniBand cluster, so I'm eager to hear about other 
> people's experiences. This doesn't appear to have been discussed on this list 
> before.
> 
> I appreciate that any affect that knem will have is entirely dependent on the 
> application, scale and input data, but:
> 
> * Does anyone know of any examples of popular software packages that benefit 
> particularly from the knem support in openmpi?
> 
> * Has anyone noticed any downsides to using knem?
> 
> Thanks,
> 
> Mark
> -- 
> -----------------------------------------------------------------
> Mark Dixon                       Email    : m.c.di...@leeds.ac.uk
> HPC/Grid Systems Support         Tel (int): 35429
> Information Systems Services     Tel (ext): +44(0)113 343 5429
> University of Leeds, LS2 9JT, UK
> -----------------------------------------------------------------
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] knem/openmpi performance?

Reply via email to