Re: [OMPI users] openmpi shared memory feature

Jeff Squyres Fri, 2 Nov 2012 09:29:47 -0400

I don't know how to explain this any more than I have: Open MPI only uses those 
files for an initial shared memory rendezvous point (and they're not really 
"files", either).  After that, all communication is done through shared memory.


Open MPI 1.6.x actually offers using multiple different types of back-end 
shared memory:

1. mmap (which is what is used by default, and what you are seeing): this 
technique creates a "file" in /tmp space, mmaps it into memory, and then has 
all the other processes do the same thing.  The file is then removed from the 
filesystem, leaving just the shared memory.

2. sysv: uses the shm* function calls to create shared memory.  See the 
shmctl(2), shmat(2), and shmget(2) man pages, for example.

3. posix: use the shm_* function calls to create shared memory.  See the 
shm_open(2) man page, for example.

You can choose a different back-end shared memory technique with the "shmem" 
MCA parameter.  For example:

   # Use the mmap method
   mpirun --mca shmem mmap ...

   # Use the sysv method
   mpirun --mca shmem sysv ...

   # Use the posix method
   mpirun --mca shmem posix ...

See if using a different shared memory mechanism helps you out.

Finally, I'll +1 on what George said; you might want to go re-read his reply 
and answer his questions.





On Nov 1, 2012, at 5:36 AM, Mahmood Naderan wrote:

> I have understood about the the advantages of shared memeory BTL. I wanted to 
> share some of my observations and gain an understanding about the internal 
> mechanisms of opemmpi. I am wondering why openmpi uses a temporary file for 
> transferring data between the two processes which are on the same node 
> (regardless of having a tmpfs or tcp stack). 
> 
> Assume there is no tmpfs. Then why P1 and P2 on another node (B in my 
> example) should communicate through tcp? Why should they use a file for 
> shared  communication. This is our observation that there is a lot of IO 
> activity (writing activity is larger than reading). Basically they should 
> communicate through the RAM of the node. An analogy for this, is the boot 
> process of node B which has no disks. At the boot process it reads the images 
> from the disk on A though network. Later it has loaded all necessary things 
> in to *its RAM* and do what ever it want though its memory.
> 
> It seems that reading and writing files for this purpose is inefficient. 
> Wouldn't  it be more logical to use interprocess communication (IPC) API to 
> transfer the pointer to the data between processes. As an observation, we 
> found that mpich2 does not use the temporary file for shared memory 
> management (though I have not figured out the mechanism yet) and achieves a 
> better performance (minor but noticable) with respect to openmpi.  
> 
> Any thoughts?
>  
> Regards,
> Mahmood
> 
> From: Jeff Squyres <jsquy...@cisco.com>
> To: Open MPI Users <us...@open-mpi.org> 
> Sent: Monday, October 29, 2012 4:31 PM
> Subject: Re: [OMPI users] openmpi shared memory feature
> 
> On Oct 29, 2012, at 11:01 AM, Ralph Castain wrote:
> 
> > Wow, that would make no sense at all. If P1 and P2 are on the same node, 
> > then we will use shared memory to do the transfer, as Jeff described. 
> > However, if you disable shared memory, as you indicated you were doing on a 
> > previous message (by adding -mca btl ^sm), then we would use a loopback 
> > device if available - i.e., the packet would be handed to the network 
> > stack, which would then return it to P2 without it ever leaving the node.
> > 
> > If there is no loopback device, and you disable shared memory, then we 
> > would abort the job with an error as there is no way for P1 to communicate 
> > with P2.
> > 
> > We would never do what you describe.
> 
> To be clear: it would probably be a good idea to have *some* tmpfs on your 
> diskless node.  Some things should simply not be on a network filesystem 
> (e.g., /tmp).  Google around; there are good reasons for having a small 
> tmpfs, even on a diskless server.
> 
> Indeed, Open MPI will warn you if it ends up putting a shared memory "file" 
> (which, as I described, isn't really a file) on a network filesystem -- e.g., 
> if /tmp is a network filesystem.  OMPI warns because corner cases can arise 
> that cause performance degradation (e.g., the OS may periodically writing out 
> the contents of shared memory to the network filesystem).
> 
> But as Ralph says: Open MPI primarily uses shared memory when communicating 
> with processes on the same server (unless you disable shared memory).  This 
> means Open MPI copies message A from P1's address space to shared memory, and 
> then P2 copies message A from shared memory to its address space.  Or, if 
> you're using the Linux knem kernel module, MPI copies message A from P1's 
> address space directly to P2's address space.  No network transfer occurs, 
> unless you possibly have /tmp on a network filesystem, and/or no /dev/shm 
> filesystem, or other corner cases like that.
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] openmpi shared memory feature

Reply via email to