Hi,
thanks for the reply,

This all sounds logical. My current leading theory is also that it has to do 
with the memcpy
implementation, in conjunction with all the other factors (arch, cpu & its 
details, copy size, etc).

I tested on additional/different architectures, and there the pattern was 
reversed, with the non-
chunked memcpy doing better (in large sizes). Furthermore, I also observed the 
pattern changing on
some systems going from the point-to-point tests to ones with collectives.

So it looks like, as expected, that there's probably not much of a 
concrete/single answer, and it
depends on a bunch of different things and their combination, with maybe not a 
single best solution
for all scenarios. To further quench my curiosity, I might experiment with 
different/custom/specific
memory copy implementations, and with different compilers (e.g. icc).

George

PS. Yes, XPMEM lazily maps the pages at page fault, the first time the 
attachment is touched (during
the benchmark's warm-up run(s)). After this is done, it shouldn't be a factor.

On Wed, 2022-08-03 at 13:30 +0000, Jeff Squyres (jsquyres) wrote:
> Sorry for the delay in replies -- it's summer / vacation season, and I think 
> we (as a community)
> are a little behind in answering some of these emails.  :-(
> 
> It's hard to say for any given machine, but a bunch of different hardware 
> factors can come into
> play, such as:
> 
> - L1, L2, L3 cache sizes
> - Cache contention
> - Memory controller connectivity and locality
> 
> I.e., exactly which hardware resources are the memcpy()'s in question using, 
> and how do they
> interact with each other?  How much overhead is produced, and/or how much 
> contention ensues when
> multiple requests are in flight simultaneously?  For example, it may be 
> counter-intuitive, but
> sometimes injecting a small amount of delay in a software pipeline can allow 
> hardware resources to
> not become overwhelmed, and therefore the overall execution becomes more 
> efficient, and therefore
> consume less wall-clock execution time.  Hence, doing 2 x 1MB memcpy()'s (to 
> effect a 2MB
> MPI_send) may actually be overall more efficient, even though the individual 
> parts of the
> transaction are less efficient.  This is a complete guess, and may have 
> nothing to do with your
> system, but it's one of many possibilities.
> 
> Another possible factor: the specific memcpy() implementation is highly 
> relevant.  It's been a few
> years since I've paid close attention to memcpy(), but at one time, there was 
> significant
> variation in the quality of memcpy() implementations between different 
> compilers and/or versions
> of libc.  I don't know if this is still a factor, or whether memcpy() is 
> pretty well optimized in
> most situations these days.  Additionally, alignment can be an issue 
> (although for message sizes
> of 2MB, I'm guessing your buffer is page-aligned, and this probably isn't an 
> issue).
> 
> All that being said, I'm not intimately familiar with the internals of XPMEM, 
> so I don't know what
> userspace/kernel space mechanisms will come into play for mapping the shared 
> memory (e.g., is it
> lazily mapping the shared memory?).
> 
> Also, you're probably doing this already, but these kinds of things are worth 
> mentioning: make
> sure your performance benchmarks are testing the right things: do warmup 
> transfers, make sure
> you're not swapping, make sure all the processes and memory are pinned 
> properly, make sure you're
> on an otherwise-quiet machine, ... etc.  All the Usual Benchmarking Things.
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> 
> ________________________________________
> From: devel <devel-boun...@lists.open-mpi.org> on behalf of Giorgos 
> Katevainis via devel
> <devel@lists.open-mpi.org>
> Sent: Thursday, July 28, 2022 9:33 AM
> To: Open MPI Developers
> Cc: Giorgos Katevainis
> Subject: [OMPI devel] Rationale behind memcpy chunk size (in smsc/xpmem)
> 
> Hello all,
> 
> I've come across the "memcpy_chunk_size" MCA parameter in smsc/xpmem, which 
> effectively causes
> memory copies to take place in chunks (used in mca_smsc_xpmem_memmove()). The 
> comment reads:
> 
> "Maximum size to copy with a single call to memcpy. On some systems a smaller 
> or larger number may
> provide better performance (default: 256k)"
> 
> And I have indeed observed performance difference by adjusting it! E.g. in a 
> simple point-to-point
> test, 2 MB messages do significantly better with the parameter set to 1 MB vs 
> 2 MB. But... why? I
> suppose I could imagine a memcpy of larger size being more efficient, but 
> what would cause many
> small ones to end up being quicker than a single large one? Might it have 
> something to do with
> memcpy intrinsics and different implementation for different sizes?
> 
> If someone knows what's going on under the hood and/or could direct me to any 
> relevant resources,
> I
> would greatly appreciate it!
> 
> George

Reply via email to