On 05/23/2012 03:05 PM, Jeff Squyres wrote:
On May 23, 2012, at 6:05 AM, Simone Pellegrini wrote:

If process A sends a message to process B and the eager protocol is used then I 
assume that the message is written into a shared memory area and picked up by 
the receiver when the receive operation is posted.
Open MPI has a few different shared memory protocols.

For short messages, they always follow what you mention above: CICO.

For large messages, we either use a pipelined CICO (as you surmised below) or 
use direct memory mapping if you have the Linux knem kernel module installed.  
More below.

When the rendezvous is utilized however the message still need to end up in the 
shared memory area somehow. I don't think any RDMA-like transfer exists for 
shared memory communications.
Just to clarify: RDMA = Remote Direct Memory Access, and the "remote" usually 
refers to a different physical address space (e.g., a different server).

In Open MPI's case, knem can use a direct memory copy between two processes.

Therefore you need to buffer this message somehow, however I       assume that 
you don't buffer the whole thing but use some type of pipelined protocol so 
that you reduce the size of the buffer you need to keep in the shared memory.
Correct.  For large messages, when using CICO, we copy the first fragment and 
the necessary meta data to the shmem block.  When the receiver ACKs the first 
fragment, we pipeline CICO the rest of the large message through the shmem 
block.  With the sender and receiver (more or less) simultaneously writing and 
reading to the circular shmem block, we probably won't fill it up -- meaning 
that the sender hypothetically won't need to block.

I'm skipping a bunch of details, but that's the general idea.

Is it completely wrong? It would be nice if someone could point me somewhere I 
can find more details about this. In the OpenMPI tuning page there are several 
details regarding the protocol utilized for IB but very little for SM.
Good point.  I'll see if we can get some more info up there.

I think I found the answer to my question on Jeff Squyres  blog:
http://blogs.cisco.com/performance/shared-memory-as-an-mpi-transport-part-2/

However now I have a new question, how do I know if my machine uses the 
copyin/copyout mechanism or the direct mapping?
You need the Linux knem module.  See the OMPI README and do a text search for 
"knem".

Thanks a lot for the clarification.
however I still have hard time to explain the following phenomena.

I have a very simple code performing a ping/pong between 2 processes which are allocated on the same computing node. Each process is bound to a different CPU via affinity settings.

I perform this operation with 3 cache scenarios
1) Cache is completely invalidate before the send/recv (both at the sender and receiver side) 2) Cache is preloaded before the send/recv operation and it's in "exclusive" state. 3) Cache is preloaded before the send/recv operation but this time cache lines are in a "modified" state

Now scenario 2 has a speedup over scenario 1 as expected. However scenario 3 is much slower then 1. I observed this for both knem and xpmem. I assume someone is forcing the modified cache lines to be written into the memory before the copy is performed. Probably because the segment is assigned to a volatile pointer so somehow the stuff in cache has to be written into main memory.

Instead when the OpenMPI CICO protocol is used 2 and 3 have the exact same speedup over 1. Therefore I assume that in this way no-one forces the write-through of dirty cache lines. I am questioning my self on this issue since yesterday and it's quite difficult to understand without knowing all the internal details.

Is this an expected behaviour also for you or you find it surprising? :)

cheers, Simone



Reply via email to