Richard Graham wrote:
Re: [OMPI devel] shared-memory allocations It does not make a difference who allocates it, what makes a difference is who touches it first.
Fair enough, but the process that allocates it right away starts to initialize it.  So, each circular buffer is set up (allocated and initialized/touched) by the sender.
> There is support for multiple circular buffers per FIFO.

The code is there, but I believe Gleb disabled using multiple fifo's, and added a list to hold pending messages, so now we are paying two overheads ...  I could be wrong here, but am pretty sure I am not.  I don't know if George has touched the code since.
I think there is support for multiple CBs (circular buffers) per FIFO.  This is why there was that recent bug about sm hanging on unidirectional messaging after so many iterations.  The sender would keep allocating room for the eager free list and on the outbound FIFO until the shared-memory area was filled.  Both the eager free list and the FIFO could grow "unbounded" (until the shared memory area was exhausted).
The cost per process is linear in the total number of processes, so overall the cost scales as the number of procs squared.  This was designed for small smp's, to reduce coordination costs between processes, and where memory costs are not large.  Once can go to very simple schemes that are constant with respect to memory footprint, but then pay the cost of multiple writers to a single queue - this is what LA-MPI did.
The point was that there are these O(3n^2) allocations -- sometimes just 12 or 64 bytes apiece -- that are taking up an entire page each due to page alignment.  I understand we're choosing to have O(n^2) FIFOs.  I'm just saying that by aggregating these numerous tiny allocations, we can make them take up 100x less space.

Patrick Geoffray wrote:
Richard Graham wrote:
Yes - it is polling volatile memory, so has to load from memory on every read.
Actually, it will poll in cache, and only load from memory when the cache coherency protocol invalidates the cache line. Volatile semantic only prevents compiler optimizations.

It does not matter much where the pages are (closer to reader or receiver) on NUMAs, as long as they are equally distributed among all sockets (ie the choice is consistent). Cache prefetching is slightly more efficient on local socket, so closer to reader may be a bit better.
Thanks for all the comments.  I think I follow all the reasoning, but what I was trying to figure out was if the design were based solely on such reasoning, or also on performance measurements.  Again, I tried some experiments.  I had two processes pingpong via shared memory and I moved the processes and the memory around -- local to sender, local to receiver, remote from both, etc.  I found the pingpong time depended only the relative positions of the sender and the receiver.  It was unrelated to the position of the shared memory backing the shared variables.  E.g., if the sender and receiver were collocated, I got best performance -- even if the shared memory was remote to both of them!  I don't know how general this result is, but it's at least one data point suggesting that the design may be based on reasoning that might be incomplete.

No big deal, but I just wanted to understand the motivation and rationale for what I see in the code.

Reply via email to