Richard Graham wrote:
Fair enough, but the process that allocates it right away starts to initialize it. So, each circular buffer is set up (allocated and initialized/touched) by the sender. > There is support for multiple circular buffers per FIFO.I think there is support for multiple CBs (circular buffers) per FIFO. This is why there was that recent bug about sm hanging on unidirectional messaging after so many iterations. The sender would keep allocating room for the eager free list and on the outbound FIFO until the shared-memory area was filled. Both the eager free list and the FIFO could grow "unbounded" (until the shared memory area was exhausted). The cost per process is linear in the total number of processes, so overall the cost scales as the number of procs squared. This was designed for small smp's, to reduce coordination costs between processes, and where memory costs are not large. Once can go to very simple schemes that are constant with respect to memory footprint, but then pay the cost of multiple writers to a single queue - this is what LA-MPI did.The point was that there are these O(3n^2) allocations -- sometimes just 12 or 64 bytes apiece -- that are taking up an entire page each due to page alignment. I understand we're choosing to have O(n^2) FIFOs. I'm just saying that by aggregating these numerous tiny allocations, we can make them take up 100x less space. Patrick Geoffray wrote: Richard Graham wrote:Thanks for all the comments. I think I follow all the reasoning, but what I was trying to figure out was if the design were based solely on such reasoning, or also on performance measurements. Again, I tried some experiments. I had two processes pingpong via shared memory and I moved the processes and the memory around -- local to sender, local to receiver, remote from both, etc. I found the pingpong time depended only the relative positions of the sender and the receiver. It was unrelated to the position of the shared memory backing the shared variables. E.g., if the sender and receiver were collocated, I got best performance -- even if the shared memory was remote to both of them! I don't know how general this result is, but it's at least one data point suggesting that the design may be based on reasoning that might be incomplete. No big deal, but I just wanted to understand the motivation and rationale for what I see in the code. |
- [OMPI devel] shared-memory allocations Eugene Loh
- Re: [OMPI devel] shared-memory allocations Jeff Squyres
- Re: [OMPI devel] shared-memory allocations Richard Graham
- Re: [OMPI devel] shared-memory allocations Eugene Loh
- Re: [OMPI devel] shared-memory allocations Richard Graham
- Re: [OMPI devel] shared-memory allocatio... Patrick Geoffray
- Re: [OMPI devel] shared-memory allo... Paul H. Hargrove
- Re: [OMPI devel] shared-memory allocatio... Eugene Loh