Re: [OMPI devel] System V Shared Memory for Open MPI: Request forCommunity Input and Testing

2010-06-11 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 10/06/10 18:43, Paul H. Hargrove wrote:

> When the file is on a real (not tmpfs or other ramdisk) I am 95% certain
> that this is an artifact of the Linux swapper/pager behavior which is
> thinking it is being smart by "swapping ahead".  Even when there is no
> memory pressure that requires swapping, Linux starts queuing swap I/O
> for pages to keep the number of "clean" pages up when possible.

I believe you can tweak that behaviour through the VM subsystem
using /proc/sys/vm/swappiness, it defaults to 60 but lower values
are meant to make the kernel less likely to swap out applications
and instead concentrate on reclaiming pages from the page cache.

cheers,
Chris
- -- 
 Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computational Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.unimelb.edu.au/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkwR0fcACgkQO2KABBYQAh8sEACggnFKMQIVummW21teI9yBqqNt
T4AAnjMSfOFONLyANjgso7kO0VAH3zi7
=X3AE
-END PGP SIGNATURE-


Re: [OMPI devel] System V Shared Memory for Open MPI: Request forCommunity Input and Testing

2010-06-11 Thread Paul H. Hargrove

Chris,

 I think that "reclaiming pages from the page cache" is the PROBLEM, 
not the solution.  If I understand you correctly a lower value of 
"swappiness" means that the ANONYMOUS pages of an application's stack 
and heap are less likely to be subject to swap I/O.  However, the 
concern here is for the pages of an mmap()ed file (though an unlinked 
one).  So, my expectation is that the page cache is their "owner" rather 
than the application.  If that is an incorrect understanding, I would 
appreciate being corrected.


-Paul

Christopher Samuel wrote:


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 10/06/10 18:43, Paul H. Hargrove wrote:

> When the file is on a real (not tmpfs or other ramdisk) I am 95% certain
> that this is an artifact of the Linux swapper/pager behavior which is
> thinking it is being smart by "swapping ahead".  Even when there is no
> memory pressure that requires swapping, Linux starts queuing swap I/O
> for pages to keep the number of "clean" pages up when possible.

I believe you can tweak that behaviour through the VM subsystem
using /proc/sys/vm/swappiness, it defaults to 60 but lower values
are meant to make the kernel less likely to swap out applications
and instead concentrate on reclaiming pages from the page cache.

cheers,
Chris
- --
 Christopher Samuel - Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computational Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.unimelb.edu.au/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkwR0fcACgkQO2KABBYQAh8sEACggnFKMQIVummW21teI9yBqqNt
T4AAnjMSfOFONLyANjgso7kO0VAH3zi7
=X3AE
-END PGP SIGNATURE-



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
HPC Research Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900



Re: [OMPI devel] System V Shared Memory for Open MPI: Request forCommunity Input and Testing

2010-06-11 Thread Paul H. Hargrove


Sylvain Jeaugey wrote:
On Thu, 10 Jun 2010, Paul H. Hargrove wrote: 

[snip]


As for why mmap is slower.  When the file is on a real (not tmpfs or 
other ramdisk) I am 95% certain that this is an artifact of the Linux 
swapper/pager behavior which is thinking it is being smart by 
"swapping ahead".  Even when there is no memory pressure that 
requires swapping, Linux starts queuing swap I/O for pages to keep 
the number of "clean" pages up when possible. This results in pages 
of the shared memory file being written out to the actual block 
device.  Both the background I/O and the VM metadata updates 
contribute to the lost time.  I say 95% certain because I have a 
colleague who looked into this phenomena in another setting and I am 
recounting what he reported as clearly as I can remember, but might 
have misunderstood or inserted my own speculation by accident.  A 
sufficiently motivated investigator (not me) could probably devise an 
experiment to verify this.
Interesting. Do you think this behavior of the linux kernel would 
change if the file was unlink()ed after attach ?


Sylvain



As Jeff pointed out, the file IS unlinked by Open MPI, presumably to 
ensure it is not left behind in case of abnormal termination.


This was also the case for the scenario I reported my colleague looking 
at.  We were (unpleasantly) surprised to find that this "swap ahead" 
behavior was being applied to an unlinked file : a case that would 
appear to be a very simple one to optimize away.  However, the simple 
fact is that Linux appears just to queue I/O to the "backing store" for 
a page regardless of little details like it being unlinked.


-Paul

--
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
HPC Research Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900



Re: [OMPI devel] System V Shared Memory for Open MPI: Request forCommunity Input and Testing

2010-06-11 Thread Jeff Squyres
On Jun 11, 2010, at 5:43 AM, Paul H. Hargrove wrote:

> > Interesting. Do you think this behavior of the linux kernel would
> > change if the file was unlink()ed after attach ?
> 
> As Jeff pointed out, the file IS unlinked by Open MPI, presumably to
> ensure it is not left behind in case of abnormal termination.

I have to admit that I lied.  :-(

Sam and I were talking on the phone yesterday about the shm_open() stuff and to 
my chagrin, I discovered that the mmap'ed files are *not* unlinked in OMPI 
until MPI_FINALIZE.  I'm not actually sure why; I could have sworn that we 
unlinked them after everyone mmap'ed them...

Regardless, Sam and I made good progress on the shm_open() stuff yesterday.  We 
should have something for Sylvain to test soon.  I believe that Sam is looking 
for the right place to put the shm_unlink() step so that we *don't* leave it 
around like we do with the mmap files.  I have a few more steps to do to add in 
the right silent-failover stuff, but we'll probably have something for Sylvain 
to test soon (final polish may be delayed a little because I'm on travel to the 
MPI Forum next week).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] System V Shared Memory for Open MPI: Request forCommunity Input and Testing

2010-06-11 Thread Sylvain Jeaugey

On Fri, 11 Jun 2010, Jeff Squyres wrote:


On Jun 11, 2010, at 5:43 AM, Paul H. Hargrove wrote:


Interesting. Do you think this behavior of the linux kernel would
change if the file was unlink()ed after attach ?
After a little talk with kernel guys, it seems that unlinking wouldn't 
change anything to performance (just prevent cleaning issues).


Sylvain


[OMPI devel] hwloc

2010-06-11 Thread Jeff Squyres
Just FYI: We fixed some Solaris issues in the hwloc paffinity the other day; it 
appears to be working properly on all platforms how.  We'll let it soak a 
little longer, but I think we're looking good for the first step of removing 
all other paffinity components and just leaving hwloc and test.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] System V Shared Memory for Open MPI: Request forCommunity Input and Testing

2010-06-11 Thread Barrett, Brian W
On Jun 11, 2010, at 5:10 AM, Jeff Squyres wrote:

> On Jun 11, 2010, at 5:43 AM, Paul H. Hargrove wrote:
> 
>>> Interesting. Do you think this behavior of the linux kernel would
>>> change if the file was unlink()ed after attach ?
>> 
>> As Jeff pointed out, the file IS unlinked by Open MPI, presumably to
>> ensure it is not left behind in case of abnormal termination.
> 
> I have to admit that I lied.  :-(
> 
> Sam and I were talking on the phone yesterday about the shm_open() stuff and 
> to my chagrin, I discovered that the mmap'ed files are *not* unlinked in OMPI 
> until MPI_FINALIZE.  I'm not actually sure why; I could have sworn that we 
> unlinked them after everyone mmap'ed them...

The idea was one large memory segment for all processes and it wasn't unlinked 
after complete attach so that we could have spawned procs also use shmem (which 
never worked, of course).  So I think we could unlink during init at this 
point..

Brian

--
  Brian W. Barrett
  Dept. 1423: Scalable System Software
  Sandia National Laboratories







Re: [OMPI devel] System V Shared Memory for Open MPI: Request forCommunity Input and Testing

2010-06-11 Thread Jeff Squyres
On Jun 11, 2010, at 12:53 PM, Barrett, Brian W wrote:

> The idea was one large memory segment for all processes and it wasn't 
> unlinked after complete attach so that we could have spawned procs also use 
> shmem (which never worked, of course).  So I think we could unlink during 
> init at this point..

I could have sworn that we decided that long ago and added the unlink.

Probably we *did* reach that conclusion long ago, but never actually got around 
to adding the unlink.  Sam and I are still in that code area now; we might as 
well add the unlink while we're there.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/