Re: [OMPI devel] System V Shared Memory for Open MPI: Request forCommunity Input and Testing
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 10/06/10 18:43, Paul H. Hargrove wrote: > When the file is on a real (not tmpfs or other ramdisk) I am 95% certain > that this is an artifact of the Linux swapper/pager behavior which is > thinking it is being smart by "swapping ahead". Even when there is no > memory pressure that requires swapping, Linux starts queuing swap I/O > for pages to keep the number of "clean" pages up when possible. I believe you can tweak that behaviour through the VM subsystem using /proc/sys/vm/swappiness, it defaults to 60 but lower values are meant to make the kernel less likely to swap out applications and instead concentrate on reclaiming pages from the page cache. cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computational Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkwR0fcACgkQO2KABBYQAh8sEACggnFKMQIVummW21teI9yBqqNt T4AAnjMSfOFONLyANjgso7kO0VAH3zi7 =X3AE -END PGP SIGNATURE-
Re: [OMPI devel] System V Shared Memory for Open MPI: Request forCommunity Input and Testing
Chris, I think that "reclaiming pages from the page cache" is the PROBLEM, not the solution. If I understand you correctly a lower value of "swappiness" means that the ANONYMOUS pages of an application's stack and heap are less likely to be subject to swap I/O. However, the concern here is for the pages of an mmap()ed file (though an unlinked one). So, my expectation is that the page cache is their "owner" rather than the application. If that is an incorrect understanding, I would appreciate being corrected. -Paul Christopher Samuel wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 10/06/10 18:43, Paul H. Hargrove wrote: > When the file is on a real (not tmpfs or other ramdisk) I am 95% certain > that this is an artifact of the Linux swapper/pager behavior which is > thinking it is being smart by "swapping ahead". Even when there is no > memory pressure that requires swapping, Linux starts queuing swap I/O > for pages to keep the number of "clean" pages up when possible. I believe you can tweak that behaviour through the VM subsystem using /proc/sys/vm/swappiness, it defaults to 60 but lower values are meant to make the kernel less likely to swap out applications and instead concentrate on reclaiming pages from the page cache. cheers, Chris - -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computational Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkwR0fcACgkQO2KABBYQAh8sEACggnFKMQIVummW21teI9yBqqNt T4AAnjMSfOFONLyANjgso7kO0VAH3zi7 =X3AE -END PGP SIGNATURE- ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
Re: [OMPI devel] System V Shared Memory for Open MPI: Request forCommunity Input and Testing
Sylvain Jeaugey wrote: On Thu, 10 Jun 2010, Paul H. Hargrove wrote: [snip] As for why mmap is slower. When the file is on a real (not tmpfs or other ramdisk) I am 95% certain that this is an artifact of the Linux swapper/pager behavior which is thinking it is being smart by "swapping ahead". Even when there is no memory pressure that requires swapping, Linux starts queuing swap I/O for pages to keep the number of "clean" pages up when possible. This results in pages of the shared memory file being written out to the actual block device. Both the background I/O and the VM metadata updates contribute to the lost time. I say 95% certain because I have a colleague who looked into this phenomena in another setting and I am recounting what he reported as clearly as I can remember, but might have misunderstood or inserted my own speculation by accident. A sufficiently motivated investigator (not me) could probably devise an experiment to verify this. Interesting. Do you think this behavior of the linux kernel would change if the file was unlink()ed after attach ? Sylvain As Jeff pointed out, the file IS unlinked by Open MPI, presumably to ensure it is not left behind in case of abnormal termination. This was also the case for the scenario I reported my colleague looking at. We were (unpleasantly) surprised to find that this "swap ahead" behavior was being applied to an unlinked file : a case that would appear to be a very simple one to optimize away. However, the simple fact is that Linux appears just to queue I/O to the "backing store" for a page regardless of little details like it being unlinked. -Paul -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
Re: [OMPI devel] System V Shared Memory for Open MPI: Request forCommunity Input and Testing
On Jun 11, 2010, at 5:43 AM, Paul H. Hargrove wrote: > > Interesting. Do you think this behavior of the linux kernel would > > change if the file was unlink()ed after attach ? > > As Jeff pointed out, the file IS unlinked by Open MPI, presumably to > ensure it is not left behind in case of abnormal termination. I have to admit that I lied. :-( Sam and I were talking on the phone yesterday about the shm_open() stuff and to my chagrin, I discovered that the mmap'ed files are *not* unlinked in OMPI until MPI_FINALIZE. I'm not actually sure why; I could have sworn that we unlinked them after everyone mmap'ed them... Regardless, Sam and I made good progress on the shm_open() stuff yesterday. We should have something for Sylvain to test soon. I believe that Sam is looking for the right place to put the shm_unlink() step so that we *don't* leave it around like we do with the mmap files. I have a few more steps to do to add in the right silent-failover stuff, but we'll probably have something for Sylvain to test soon (final polish may be delayed a little because I'm on travel to the MPI Forum next week). -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] System V Shared Memory for Open MPI: Request forCommunity Input and Testing
On Fri, 11 Jun 2010, Jeff Squyres wrote: On Jun 11, 2010, at 5:43 AM, Paul H. Hargrove wrote: Interesting. Do you think this behavior of the linux kernel would change if the file was unlink()ed after attach ? After a little talk with kernel guys, it seems that unlinking wouldn't change anything to performance (just prevent cleaning issues). Sylvain
[OMPI devel] hwloc
Just FYI: We fixed some Solaris issues in the hwloc paffinity the other day; it appears to be working properly on all platforms how. We'll let it soak a little longer, but I think we're looking good for the first step of removing all other paffinity components and just leaving hwloc and test. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] System V Shared Memory for Open MPI: Request forCommunity Input and Testing
On Jun 11, 2010, at 5:10 AM, Jeff Squyres wrote: > On Jun 11, 2010, at 5:43 AM, Paul H. Hargrove wrote: > >>> Interesting. Do you think this behavior of the linux kernel would >>> change if the file was unlink()ed after attach ? >> >> As Jeff pointed out, the file IS unlinked by Open MPI, presumably to >> ensure it is not left behind in case of abnormal termination. > > I have to admit that I lied. :-( > > Sam and I were talking on the phone yesterday about the shm_open() stuff and > to my chagrin, I discovered that the mmap'ed files are *not* unlinked in OMPI > until MPI_FINALIZE. I'm not actually sure why; I could have sworn that we > unlinked them after everyone mmap'ed them... The idea was one large memory segment for all processes and it wasn't unlinked after complete attach so that we could have spawned procs also use shmem (which never worked, of course). So I think we could unlink during init at this point.. Brian -- Brian W. Barrett Dept. 1423: Scalable System Software Sandia National Laboratories
Re: [OMPI devel] System V Shared Memory for Open MPI: Request forCommunity Input and Testing
On Jun 11, 2010, at 12:53 PM, Barrett, Brian W wrote: > The idea was one large memory segment for all processes and it wasn't > unlinked after complete attach so that we could have spawned procs also use > shmem (which never worked, of course). So I think we could unlink during > init at this point.. I could have sworn that we decided that long ago and added the unlink. Probably we *did* reach that conclusion long ago, but never actually got around to adding the unlink. Sam and I are still in that code area now; we might as well add the unlink while we're there. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/