Re: [OMPI devel] System V Shared Memory for Open MPI: Request for Community Input and Testing
Hi all, Does anyone know of a relatively portable solution for querying a given system for the shmctl behavior that I am relying on, or is this going to be a nightmare? Because, if I am reading this thread correctly, the presence of shmget and Linux is not sufficient for determining an adequate level of sysv support. Thanks! -- Samuel K. Gutierrez Los Alamos National Laboratory On May 2, 2010, at 7:48 AM, N.M. Maclaren wrote: On May 2 2010, Ashley Pittman wrote: On 2 May 2010, at 04:03, Samuel K. Gutierrez wrote: As to performance there should be no difference in use between sys- V shared memory and file-backed shared memory, the instructions issued and the MMU flags for the page should both be the same so the performance should be identical. Not necessarily, and possibly not so even for far-future Linuces. On at least one system I used, the poxious kernel wrote the complete file to disk before returning - all right, it did that for System V shared memory, too, just to a 'hidden' file! But, if I recall, on another it did that only for file-backed shared memory - however, it's a decade ago now and I may be misremembering. Of course, that's a serious issue mainly for large segments. I was using multi-GB ones. I don't know how big the ones you need are. The one area you do need to keep an eye on for performance is on numa machines where it's important which process on a node touches each page first, you can end up using different areas (pages, not regions) for communicating in different directions between the same pair of processes. I don't believe this is any different to mmap backed shared memory though. On some systems it may be, but in bizarre, inconsistent, undocumented and unpredictable ways :-( Also, there are usually several system (and sometimes user) configuration options that change the behaviour, so you have to allow for that. My experience of trying to use those is that different uses have incompatible requirements, and most of the critical configuration parameters apply to ALL uses! In my view, the configuration variability is the number one nightmare for trying to write portable code that uses any form of shared memory. ARMCI seem to agree. Because of this, sysv support may be limited to Linux systems - that is, until we can get a better sense of which systems provide the shmctl IPC_RMID behavior that I am relying on. And, I suggest, whether they have an evil gotcha on one of the areas that Ashley Pittman noted. Regards, Nick Maclaren. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] System V Shared Memory for Open MPI: Request for Community Input and Testing
On May 2 2010, Ashley Pittman wrote: On 2 May 2010, at 04:03, Samuel K. Gutierrez wrote: As to performance there should be no difference in use between sys-V shared memory and file-backed shared memory, the instructions issued and the MMU flags for the page should both be the same so the performance should be identical. Not necessarily, and possibly not so even for far-future Linuces. On at least one system I used, the poxious kernel wrote the complete file to disk before returning - all right, it did that for System V shared memory, too, just to a 'hidden' file! But, if I recall, on another it did that only for file-backed shared memory - however, it's a decade ago now and I may be misremembering. Of course, that's a serious issue mainly for large segments. I was using multi-GB ones. I don't know how big the ones you need are. The one area you do need to keep an eye on for performance is on numa machines where it's important which process on a node touches each page first, you can end up using different areas (pages, not regions) for communicating in different directions between the same pair of processes. I don't believe this is any different to mmap backed shared memory though. On some systems it may be, but in bizarre, inconsistent, undocumented and unpredictable ways :-( Also, there are usually several system (and sometimes user) configuration options that change the behaviour, so you have to allow for that. My experience of trying to use those is that different uses have incompatible requirements, and most of the critical configuration parameters apply to ALL uses! In my view, the configuration variability is the number one nightmare for trying to write portable code that uses any form of shared memory. ARMCI seem to agree. Because of this, sysv support may be limited to Linux systems - that is, until we can get a better sense of which systems provide the shmctl IPC_RMID behavior that I am relying on. And, I suggest, whether they have an evil gotcha on one of the areas that Ashley Pittman noted. Regards, Nick Maclaren.
Re: [OMPI devel] System V Shared Memory for Open MPI: Request for Community Input and Testing
On 02/05/10 06:49, Ashley Pittman wrote: > I think you should look into this a little deeper, it > certainly used to be the case on Linux that setting > IPC_RMID would also prevent any further processes from > attaching to the segment. That certainly appears to be the case in the current master of the kernel, IPC_PRIVATE is set on the segment with the comment: /* Do not find it any more */ That flag means that ipcget() - used by sys_shmget() - take a different code path and now call ipcget_new() rather than ipcget_public(). cheers, Chris -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computational Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/
Re: [OMPI devel] System V Shared Memory for Open MPI: Request for Community Input and Testing
On 01/05/10 23:03, Samuel K. Gutierrez wrote: > I call shmctl IPC_RMID immediately after one process has > attached to the segment because, at least on Linux, this > only marks the segment for destruction. That's correct, looking at the kernel code (at least in the current git master) the function that handles this - do_shm_rmid() in ipc/shm.c - only destroys the segment if nobody is attached to it, otherwise it marks the segment as IPC_PRIVATE to stop others finding it and with SHM_DEST so that it is automatically destroyed on the last detach. cheers, Chris -- Christopher Samuel - Senior Systems Administrator VLSCI - Victorian Life Sciences Computational Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.unimelb.edu.au/
Re: [OMPI devel] System V Shared Memory for Open MPI: Request for Community Input and Testing
On 2 May 2010, at 04:03, Samuel K. Gutierrez wrote: > As far as I can tell, calling shmctl IPC_RMID is immediately destroying > the shared memory segment even though there is at least one process > attached to it. This is interesting and confusing because Solaris 10's > behavior description of shmctl IPC_RMID is similar to that of Linux'. > > I call shmctl IPC_RMID immediately after one process has attached to the > segment because, at least on Linux, this only marks the segment for > destruction. The segment is only actually destroyed after all attached > processes have terminated. I'm relying on this behavior for resource > cleanup upon application termination (normal/abnormal). I think you should look into this a little deeper, it certainly used to be the case on Linux that setting IPC_RMID would also prevent any further processes from attaching to the segment. You're right that minimising the window that the region exists for without that bit set is good, both in terms of wall-clock-time and lines of code, what we used to do here was to have all processes on a node perform a out-of-band intra-node barrier before creating the segment and another in-band barrier immediately after creating it. Without this if one process on a node has problems and aborts during startup before it gets to the shared memory code then you are almost guaranteed to leave a un-attached segment behind. As to performance there should be no difference in use between sys-V shared memory and file-backed shared memory, the instructions issued and the MMU flags for the page should both be the same so the performance should be identical. The one area you do need to keep an eye on for performance is on numa machines where it's important which process on a node touches each page first, you can end up using different areas (pages, not regions) for communicating in different directions between the same pair of processes. I don't believe this is any different to mmap backed shared memory though. > Because of this, sysv support may be limited to Linux systems - that is, > until we can get a better sense of which systems provide the shmctl > IPC_RMID behavior that I am relying on. Ashley, -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Re: [OMPI devel] System V Shared Memory for Open MPI: Request for Community Input and Testing
Hi Ethan, Sorry about the lag. As far as I can tell, calling shmctl IPC_RMID is immediately destroying the shared memory segment even though there is at least one process attached to it. This is interesting and confusing because Solaris 10's behavior description of shmctl IPC_RMID is similar to that of Linux'. I call shmctl IPC_RMID immediately after one process has attached to the segment because, at least on Linux, this only marks the segment for destruction. The segment is only actually destroyed after all attached processes have terminated. I'm relying on this behavior for resource cleanup upon application termination (normal/abnormal). Because of this, sysv support may be limited to Linux systems - that is, until we can get a better sense of which systems provide the shmctl IPC_RMID behavior that I am relying on. Any other ideas are greatly appreciated. Thanks for testing! -- Samuel K. Gutierrez Los Alamos National Laboratory > On Thu, Apr/29/2010 02:52:24PM, Samuel K. Gutierrez wrote: >> Hi Ethan, >> Bummer. What does the following command show? >> sysctl -a | grep shm > > In this case, I think the Solaris equivalent to sysctl is prctl, e.g., > > $ prctl -i project group.staff > project: 10: group.staff > NAMEPRIVILEGE VALUEFLAG ACTION > RECIPIENT > ... > project.max-shm-memory > privileged 3.92GB - deny > - > system 16.0EBmax deny > - > project.max-shm-ids > privileged128 - deny > - > system 16.8M max deny > - > ... > > Is that the info you need? > > -Ethan > >> Thanks! >> -- >> Samuel K. Gutierrez >> Los Alamos National Laboratory >> On Apr 29, 2010, at 1:32 PM, Ethan Mallove wrote: >> > Hi Samuel, >> > >> > I'm trying to run off your HG clone, but I'm seeing issues with c_hello, e.g., >> > >> > $ mpirun -mca mpi_common_sm sysv --mca btl self,sm,tcp --host >> > burl-ct-v440-2,burl-ct-v440-2 -np 2 ./c_hello >> > -- A system call failed during shared memory initialization that should not have. It is likely that your MPI job will now either abort or experience performance degradation. >> > >> >Local host: burl-ct-v440-2 >> >System call: shmat(2) >> >Process: [[43408,1],1] >> >Error: Invalid argument (errno 22) >> > -- ^Cmpirun: killing job... >> > >> > $ uname -a >> > SunOS burl-ct-v440-2 5.10 Generic_118833-33 sun4u sparc >> SUNW,Sun-Fire-V440 >> > >> > The same test works okay if I s/sysv/mmap/. >> > >> > Regards, >> > Ethan >> > >> > >> > On Wed, Apr/28/2010 07:16:12AM, Samuel K. Gutierrez wrote: >> >> Hi, >> >> >> >> Faster component initialization/finalization times is one of the main >> >> motivating factors of this work. The general idea is to get away >> from >> >> creating a rather large backing file. With respect to module >> bandwidth >> >> and >> >> latency, mmap and sysv seem to be comparable - at least that is what >> my >> >> preliminary tests have shown. As it stands, I have not come across a >> >> situation where the mmap SM component doesn't work or is slower. >> >> >> >> Hope that helps, >> >> >> >> -- >> >> Samuel K. Gutierrez >> >> Los Alamos National Laboratory >> >> >> >> >> >> >> >> >> >> >> >> On Apr 28, 2010, at 5:35 AM, Bogdan Costescu wrote: >> >> >> >>> On Tue, Apr 27, 2010 at 7:55 PM, Samuel K. Gutierrez >>>> >>> wrote: >> With Jeff and Ralph's help, I have completed a System V shared >> memory >> component for Open MPI. >> >>> >> >>> What is the motivation for this work ? Are there situations where >> the >> >>> mmap based SM component doesn't work or is slow(er) ? >> >>> >> >>> Kind regards, >> >>> Bogdan >> >>> ___ >> >>> devel mailing list >> >>> de...@open-mpi.org >> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> >> ___ >> >> devel mailing list >> >> de...@open-mpi.org >> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > ___ >> > devel mailing list >> > de...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/devel >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] System V Shared Memory for Open MPI: Request for Community Input and Testing
Hi Ethan, Bummer. What does the following command show? sysctl -a | grep shm Thanks! -- Samuel K. Gutierrez Los Alamos National Laboratory On Apr 29, 2010, at 1:32 PM, Ethan Mallove wrote: Hi Samuel, I'm trying to run off your HG clone, but I'm seeing issues with c_hello, e.g., $ mpirun -mca mpi_common_sm sysv --mca btl self,sm,tcp --host burl- ct-v440-2,burl-ct-v440-2 -np 2 ./c_hello -- A system call failed during shared memory initialization that should not have. It is likely that your MPI job will now either abort or experience performance degradation. Local host: burl-ct-v440-2 System call: shmat(2) Process: [[43408,1],1] Error: Invalid argument (errno 22) -- ^Cmpirun: killing job... $ uname -a SunOS burl-ct-v440-2 5.10 Generic_118833-33 sun4u sparc SUNW,Sun- Fire-V440 The same test works okay if I s/sysv/mmap/. Regards, Ethan On Wed, Apr/28/2010 07:16:12AM, Samuel K. Gutierrez wrote: Hi, Faster component initialization/finalization times is one of the main motivating factors of this work. The general idea is to get away from creating a rather large backing file. With respect to module bandwidth and latency, mmap and sysv seem to be comparable - at least that is what my preliminary tests have shown. As it stands, I have not come across a situation where the mmap SM component doesn't work or is slower. Hope that helps, -- Samuel K. Gutierrez Los Alamos National Laboratory On Apr 28, 2010, at 5:35 AM, Bogdan Costescu wrote: On Tue, Apr 27, 2010 at 7:55 PM, Samuel K. Gutierrezwrote: With Jeff and Ralph's help, I have completed a System V shared memory component for Open MPI. What is the motivation for this work ? Are there situations where the mmap based SM component doesn't work or is slow(er) ? Kind regards, Bogdan ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] System V Shared Memory for Open MPI: Request for Community Input and Testing
Hi, Faster component initialization/finalization times is one of the main motivating factors of this work. The general idea is to get away from creating a rather large backing file. With respect to module bandwidth and latency, mmap and sysv seem to be comparable - at least that is what my preliminary tests have shown. As it stands, I have not come across a situation where the mmap SM component doesn't work or is slower. Hope that helps, -- Samuel K. Gutierrez Los Alamos National Laboratory On Apr 28, 2010, at 5:35 AM, Bogdan Costescu wrote: On Tue, Apr 27, 2010 at 7:55 PM, Samuel K. Gutierrezwrote: With Jeff and Ralph's help, I have completed a System V shared memory component for Open MPI. What is the motivation for this work ? Are there situations where the mmap based SM component doesn't work or is slow(er) ? Kind regards, Bogdan ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] System V Shared Memory for Open MPI: Request for Community Input and Testing
On Tue, Apr 27, 2010 at 7:55 PM, Samuel K. Gutierrezwrote: > With Jeff and Ralph's help, I have completed a System V shared memory > component for Open MPI. What is the motivation for this work ? Are there situations where the mmap based SM component doesn't work or is slow(er) ? Kind regards, Bogdan