Re: [OMPI devel] SM backing file size

Jeff Squyres Fri, 14 Nov 2008 09:22:40 -0500

Ok. Should be pretty easy to test/simulate to figure out what's goingon -- e.g., whether it's segv'ing in MPI_INIT or the first MPI_SEND.


On Nov 14, 2008, at 9:19 AM, Ralph Castain wrote:

Until we do complete the switch, and for systems that do not supportthe alternate type of shared memory (I believe it is only Linux?), Itruly believe we should do something nicer than segv.
Just to clarify: I know the segv case was done with paffinity set,and believe both cases were done that way. In the first case, I wastold that the segv hit when they did MPI_Send, but I did notpersonally verify that claim - it could be that it hit duringmaffinity binding if, as you suggest, we actually touch the page atthat time.
Ralph



On Nov 14, 2008, at 7:07 AM, Jeff Squyres wrote:
It's been a looooong time since I've looked at the sm code; Eugenehas looked at it much more in-depth recently than I have. But I'mguessing we *haven't* checked this stuff to abort nicely in sucherror conditions. We might very well succeed in the mmap but thensegv later when the memory isn't actually available. Perhaps weshould try to touch every page first to ensure that it's actuallythere...? (I'm pretty sure we do this when using paffinity toensure to maffinity bind memory to processors -- perhaps we're notdoing that in the !paffinity case?)
Additionally, another solution might well be what Tim has longadvocated: switch to the other type of shared memory on systemsthat support auto-pruning it when all processes die, and/or havethe orted kill it when all processes die. Then a) we're notdependent on the filesystem free space, and b) we're not writingall the dirty pages to disk when the processes exit.
On Nov 14, 2008, at 8:42 AM, Ralph Castain wrote:
Hi Eugene
I too am interested - I think we need to do something about the smbacking file situation as larger core machines are slated tobecome more prevalent shortly.
I appreciate your info on the sizes and controls. One otherquestion: what happens when there isn't enough memory to supportall this? Are we smart enough to detect this situation? Does thesm subsystem quietly shut down? Warn and shut down? Segfault?
I have two examples so far:
1. using a ramdisk, /tmp was set to 10MB. OMPI was run on a singlenode, 2ppn, with btl=openib,sm,self. The program started, butsegfaulted on the first MPI_Send. No warnings were printed.
2. again with a ramdisk, /tmp was reportedly set to 16MB(unverified - some uncertainty, could be have been much larger).OMPI was run on multiple nodes, 16ppn, with btl=openib,sm,self.The program ran to completion without errors or warning. I don'tknow the communication pattern - could be no local comm wasperformed, though that sounds doubtful.
If someone doesn't know, I'll have to dig into the code and figureout the response - just hoping that someone can spare me the pain.
Thanks
Ralph


On Nov 13, 2008, at 3:21 PM, Eugene Loh wrote:
Ralph Castain wrote:
As has frequently been commented upon at one time or another,the shared memory backing file can be quite huge. There used tobe a param for controlling this size, but I can't find it in1.3 - or at least, the name or method for controlling file sizehas morphed into something I don't recognize.
Can someone more familiar with that subsystem point me to one ormore params that will allow us to control the size of thatfile? It is swamping our systems and causing OMPI to segfault.
Sounds like you've already gotten your answers, but I'll add my$0.02 anyhow.
The file size is the number of local processes (call it n) timesmpool_sm_per_peer_size (default 32M), but with a minimum ofmpool_sm_min_size (default 128M) and a maximum ofmpool_sm_max_size (default 2G? 256M?). So, you can tweak thoseparameters to control file size.
Another issue is possibly how small a backing file you can getaway with. That is, just forcing the file to be smaller may notbe enough since your job may no longer run. The backing fileseems to be used mainly by:
*) eager-fragment free lists: We start with enough eagerfragments so that we could have two per connection. So, youcould bump the sm eager size down if you need to shoehorn a jobinto a very small backing file.
*) large-fragment free lists: We start with 8*n largefragments. If this term plagues you, you can bump the sm chunksize down or reduce the value of 8 (using btl_sm_free_list_num, Ithink).
*) FIFOs: The code tries to align a number of things on pagesizeboundaries, so you end up with about 3*n*n*pagesize overheadhere. If this term is causing you problems, you're stuck (unlessyou modify OMPI).
I'm interested in this subject!  :^)
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Jeff Squyres
Cisco Systems

_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] SM backing file size

Reply via email to