Re: [OMPI devel] System V Shared Memory for Open MPI: Request forCommunity Input and Testing

2010-06-10 Thread Sylvain Jeaugey

On Wed, 9 Jun 2010, Jeff Squyres wrote:


On Jun 9, 2010, at 3:26 PM, Samuel K. Gutierrez wrote:


System V shared memory cleanup is a concern only if a process dies in
between shmat and shmctl IPC_RMID.  Shared memory segment cleanup
should happen automagically in most cases, including abnormal process
termination.


Umm... right.  Duh.  I knew that.

Really.

So -- we're good!

Let's open the discussion of making sysv the default on systems that support 
the IPC_RMID behavior (which, AFAIK, is only Linux)...

I'm sorry, but I think System V has many disadvantages over mmap.

1. As discussed before, cleaning is not as easy as for a file. It is a 
good thing to remove the shm segment after creation, but since problems 
often happen during shmget/shmat, there's still a high risk of letting 
things behind.


2. There are limits in the kernel you need to grow (kernel.shmall, 
kernel.shmmax). On most linux distribution, shmmax is 32MB, which does 
not permit the sysv mechanism to work. Mmapped files are unlimited.


3. Each shm segment is identified by a 32 bit integer. This namespace is 
small (and non-intuitive, as opposed to a file name), and the probability 
for a collision is not null, especially when you start creating multiple 
shared memory segments (for collectives, one-sided operations, ...).


So, I'm a bit reluctant to work with System V mechanisms again. I don't 
think there is a *real* reason for System V to be faster than mmap, since 
it should just be memory. I'd rather find out why mmap is slower.


Sylvain


Re: [OMPI devel] System V Shared Memory for Open MPI: Request forCommunity Input and Testing

2010-06-10 Thread Paul H. Hargrove



Sylvain Jeaugey wrote:

On Wed, 9 Jun 2010, Jeff Squyres wrote:


On Jun 9, 2010, at 3:26 PM, Samuel K. Gutierrez wrote:


System V shared memory cleanup is a concern only if a process dies in
between shmat and shmctl IPC_RMID.  Shared memory segment cleanup
should happen automagically in most cases, including abnormal process
termination.


Umm... right.  Duh.  I knew that.

Really.

So -- we're good!

Let's open the discussion of making sysv the default on systems that 
support the IPC_RMID behavior (which, AFAIK, is only Linux)...

I'm sorry, but I think System V has many disadvantages over mmap.

1. As discussed before, cleaning is not as easy as for a file. It is a 
good thing to remove the shm segment after creation, but since 
problems often happen during shmget/shmat, there's still a high risk 
of letting things behind.


2. There are limits in the kernel you need to grow (kernel.shmall, 
kernel.shmmax). On most linux distribution, shmmax is 32MB, which does 
not permit the sysv mechanism to work. Mmapped files are unlimited.


3. Each shm segment is identified by a 32 bit integer. This namespace 
is small (and non-intuitive, as opposed to a file name), and the 
probability for a collision is not null, especially when you start 
creating multiple shared memory segments (for collectives, one-sided 
operations, ...).


So, I'm a bit reluctant to work with System V mechanisms again. I 
don't think there is a *real* reason for System V to be faster than 
mmap, since it should just be memory. I'd rather find out why mmap is 
slower.


Sylvain
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


One should not ignore the option of POSIX shared memory: shm_open() and 
shm_unlink().  When present this mechanism usually does not suffer from 
the small (eg 32MB) limits of SysV, and uses a "filename" (in an 
abstract namespace) which can portably be up 14 characters in length.  
Because shm_unlink() may be called as soon as the final process has done 
its shm_open() one can get approximately the safety of the IPC_RMID 
mechanism, but w/o being restricted to Linux.


I have used POSIX shared memory for another project and found it works 
well on Linux, Solaris (10 and Open), FreeBSD and AIX.  That is probably 
a narrow coverage than SysV, but still worth consideration IMHO.  With 
mmap(), SysV and POSIX (plus XPMEM on the SGI Altix) as mechanisms for 
sharing memory between processes, I think we have an argument for a 
full-blown "shared pages" framework as opposed to just a "mpi_common_sm" 
MCA parameter.  That brings all the  benefits like possibly "failing 
over" from one component to another (otherwise less desired) one if some 
limit is exceeded.  For instance, SysV could (for a given set of 
priorities) be used by default, but mmap-on-real-fs could be 
automatically selected when the requested/required size exceeds the 
shmmax value.


As for why mmap is slower.  When the file is on a real (not tmpfs or 
other ramdisk) I am 95% certain that this is an artifact of the Linux 
swapper/pager behavior which is thinking it is being smart by "swapping 
ahead".  Even when there is no memory pressure that requires swapping, 
Linux starts queuing swap I/O for pages to keep the number of "clean" 
pages up when possible. This results in pages of the shared memory file 
being written out to the actual block device.  Both the background I/O 
and the VM metadata updates contribute to the lost time.  I say 95% 
certain because I have a colleague who looked into this phenomena in 
another setting and I am recounting what he reported as clearly as I can 
remember, but might have misunderstood or inserted my own speculation by 
accident.  A sufficiently motivated investigator (not me) could probably 
devise an experiment to verify this.


-Paul

--
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
HPC Research Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900



Re: [OMPI devel] System V Shared Memory for Open MPI: Request forCommunity Input and Testing

2010-06-10 Thread Sylvain Jeaugey

On Thu, 10 Jun 2010, Paul H. Hargrove wrote:

One should not ignore the option of POSIX shared memory: shm_open() and 
shm_unlink().  When present this mechanism usually does not suffer from 
the small (eg 32MB) limits of SysV, and uses a "filename" (in an 
abstract namespace) which can portably be up 14 characters in length. 
Because shm_unlink() may be called as soon as the final process has done 
its shm_open() one can get approximately the safety of the IPC_RMID 
mechanism, but w/o being restricted to Linux.


I have used POSIX shared memory for another project and found it works 
well on Linux, Solaris (10 and Open), FreeBSD and AIX.  That is probably 
a narrow coverage than SysV, but still worth consideration IMHO.
I was just doing research on shm_open() to ensure it had no limitation 
before introducing it in this thread. You saved me some time !


With mmap(), SysV and POSIX (plus XPMEM on the SGI Altix) as mechanisms 
for sharing memory between processes, I think we have an argument for a 
full-blown "shared pages" framework as opposed to just a "mpi_common_sm" 
MCA parameter.  That brings all the benefits like possibly "failing 
over" from one component to another (otherwise less desired) one if some 
limit is exceeded.  For instance, SysV could (for a given set of 
priorities) be used by default, but mmap-on-real-fs could be 
automatically selected when the requested/required size exceeds the 
shmmax value.

Would be indeed nice.

As for why mmap is slower.  When the file is on a real (not tmpfs or other 
ramdisk) I am 95% certain that this is an artifact of the Linux swapper/pager 
behavior which is thinking it is being smart by "swapping ahead".  Even when 
there is no memory pressure that requires swapping, Linux starts queuing swap 
I/O for pages to keep the number of "clean" pages up when possible. This 
results in pages of the shared memory file being written out to the actual 
block device.  Both the background I/O and the VM metadata updates contribute 
to the lost time.  I say 95% certain because I have a colleague who looked 
into this phenomena in another setting and I am recounting what he reported 
as clearly as I can remember, but might have misunderstood or inserted my own 
speculation by accident.  A sufficiently motivated investigator (not me) 
could probably devise an experiment to verify this.
Interesting. Do you think this behavior of the linux kernel would change 
if the file was unlink()ed after attach ?


Sylvain


Re: [OMPI devel] System V Shared Memory for Open MPI: Request forCommunity Input and Testing

2010-06-10 Thread Jeff Squyres
On Jun 10, 2010, at 4:43 AM, Paul H. Hargrove wrote:

> One should not ignore the option of POSIX shared memory: shm_open() and
> shm_unlink().  When present this mechanism usually does not suffer from
> the small (eg 32MB) limits of SysV, and uses a "filename" (in an
> abstract namespace) which can portably be up 14 characters in length. 
> Because shm_unlink() may be called as soon as the final process has done
> its shm_open() one can get approximately the safety of the IPC_RMID
> mechanism, but w/o being restricted to Linux.

FWIW, with the infrastructure work that Sam did, it would probably be the work 
of about an hour or two to add shm_open()/etc. into the common sm stuff.

> I have used POSIX shared memory for another project and found it works
> well on Linux, Solaris (10 and Open), FreeBSD and AIX.  That is probably
> a narrow coverage than SysV, but still worth consideration IMHO.  With
> mmap(), SysV and POSIX (plus XPMEM on the SGI Altix) as mechanisms for
> sharing memory between processes, I think we have an argument for a
> full-blown "shared pages" framework as opposed to just a "mpi_common_sm"
> MCA parameter.  That brings all the  benefits like possibly "failing
> over" from one component to another (otherwise less desired) one if some
> limit is exceeded.  For instance, SysV could (for a given set of
> priorities) be used by default, but mmap-on-real-fs could be
> automatically selected when the requested/required size exceeds the
> shmmax value.

That's more-or-less what Sam did.

Sam -- if the shmat stuff fails because the limits are too low, it'll 
(silently) fall back to the mmap module, right?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] System V Shared Memory for Open MPI: Request forCommunity Input and Testing

2010-06-10 Thread Jeff Squyres
On Jun 10, 2010, at 4:57 AM, Sylvain Jeaugey wrote:

> > As for why mmap is slower.  When the file is on a real (not tmpfs or other
> > ramdisk) I am 95% certain that this is an artifact of the Linux 
> > swapper/pager
> > behavior which is thinking it is being smart by "swapping ahead".  Even when
> > there is no memory pressure that requires swapping, Linux starts queuing 
> > swap
> > I/O for pages to keep the number of "clean" pages up when possible. This
> > results in pages of the shared memory file being written out to the actual
> > block device.  Both the background I/O and the VM metadata updates 
> > contribute
> > to the lost time.  I say 95% certain because I have a colleague who looked
> > into this phenomena in another setting and I am recounting what he reported
> > as clearly as I can remember, but might have misunderstood or inserted my 
> > own
> > speculation by accident.  A sufficiently motivated investigator (not me)
> > could probably devise an experiment to verify this.
> Interesting. Do you think this behavior of the linux kernel would change
> if the file was unlink()ed after attach ?

Note that OMPI does unlink the mmap'ed file after attach.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] System V Shared Memory for Open MPI: Request forCommunity Input and Testing

2010-06-10 Thread Sylvain Jeaugey

On Thu, 10 Jun 2010, Jeff Squyres wrote:

Sam -- if the shmat stuff fails because the limits are too low, it'll 
(silently) fall back to the mmap module, right?
From my experience, it completely disabled the sm component. Having a nice 

fallback would be indeed a very Good thing.

Sylvain


Re: [OMPI devel] System V Shared Memory for Open MPI: Request forCommunity Input and Testing

2010-06-10 Thread Samuel K. Gutierrez

On Jun 10, 2010, at 1:47 AM, Sylvain Jeaugey wrote:


On Wed, 9 Jun 2010, Jeff Squyres wrote:


On Jun 9, 2010, at 3:26 PM, Samuel K. Gutierrez wrote:

System V shared memory cleanup is a concern only if a process dies  
in

between shmat and shmctl IPC_RMID.  Shared memory segment cleanup
should happen automagically in most cases, including abnormal  
process

termination.


Umm... right.  Duh.  I knew that.

Really.

So -- we're good!

Let's open the discussion of making sysv the default on systems  
that support the IPC_RMID behavior (which, AFAIK, is only Linux)...

I'm sorry, but I think System V has many disadvantages over mmap.

1. As discussed before, cleaning is not as easy as for a file. It is  
a good thing to remove the shm segment after creation, but since  
problems often happen during shmget/shmat, there's still a high risk  
of letting things behind.


2. There are limits in the kernel you need to grow (kernel.shmall,  
kernel.shmmax).


I agree that this is a disadvantage, but changing shmall and shmmax  
limits is *only* as painful as having a system admin change a few  
settings (okay, it's painful ;-) ).


On most linux distribution, shmmax is 32MB, which does not permit  
the sysv mechanism to work. Mmapped files are unlimited.


Not necessarily true.  If a user *really* wanted to use sysv and their  
system's shmmax limit was 32MB, they could just add -mca  
mpool_sm_min_size 3355 and everything would work properly.  I do  
understand, however, that this may not be ideal and may have  
performance implications.


Based on this, I'm leaning towards the default behavior that we  
currently have in the trunk:


- sysv disabled by default
- use mmap, unless sysv is explicitly requested by the user



3. Each shm segment is identified by a 32 bit integer. This  
namespace is small (and non-intuitive, as opposed to a file name),  
and the probability for a collision is not null, especially when you  
start creating multiple shared memory segments (for collectives, one- 
sided operations, ...).


I'm not sure if collisions are a problem.  I'm using  
shmget(IPC_PRIVATE), so I'm guessing once I've asked for more than ~  
2^16 keys, things will fail.




So, I'm a bit reluctant to work with System V mechanisms again. I  
don't think there is a *real* reason for System V to be faster than  
mmap, since it should just be memory. I'd rather find out why mmap  
is slower.


Jeff and I talked, and we are going to hack something together that  
uses shm_open and friends and incorporates more sophisticated fallback  
mechanisms if a particular component fails initialization.  Once we  
are done with that work, would you be willing to conduct another  
similar performance study that incorporates all sm mechanisms?


Thanks,

--
Samuel K. Gutierrez
Los Alamos National Laboratory



Sylvain
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel