Re: [OMPI devel] [RFC] Send data from the end of a buffer during pipeline proto
On Thu, May 17, 2007 at 10:20:35AM -0600, Brian Barrett wrote: > On the other hand, since the MPI standard explicitly says you're not > allowed to call fork() or system() during the MPI application and > sense the network should really cope with this in some way, if it > further complicates the code *at all*, I'm strongly against it. > Especially since it won't really solve the problem. For example, > with one-sided, I'm not going to go out of my way to send the first > and last bit of the buffer so the user can touch those pages while > calling fork. > > Also, if I understand the leave_pinned protocol, this still won't > really solve anything for the general case -- leave pinned won't send > any data eagerly if the buffer is already pinned, so there are still > going to be situations where the user can cause problems. Now we > have a situation where sometimes it works and sometimes it doesn't > and we pretend to support fork()/system() in certain cases. Seems > like actually fixing the problem the "right way" would be the right > path forward... This will not solve all the problems, it just slightly decries a chance of a program to get SIGSEGV. We will not going to pretend that we support fork() or system(). Obviously this change will not help to one sided, leave_pinned or leave_pinned_pipeline cases. About "complicating the code" issue I am working on solving deadlock in pipeline protocol and for this I need a capability to send any part of a message by copy in/out. The change I propose will be trivial to do on top of this. The code will be more complex because of deadlock issues and not because of the change we discuss now :) > > Brian > > On May 17, 2007, at 10:10 AM, Jeff Squyres wrote: > > > Moving to devel; this question seems worthwhile to push out to the > > general development community. > > > > I've been coming across an increasing number of customers and other > > random OMPI users who use system(). So if there's zero impact on > > performance and it doesn't make the code [more] incredibly horrible > > [than it already is], I'm in favor of this change. > > > > > > > > On May 17, 2007, at 7:00 AM, Gleb Natapov wrote: > > > >> Hi, > >> > >> I thought about changing pipeline protocol to send data from the > >> end of > >> the message instead of the middle like it does now. The rationale > >> behind > >> this is better fork() support. When application forks, child doesn't > >> inherit registered memory, so IB providers educate users to not touch > >> buffers that were owned by the MPI before fork in a child process. > >> The > >> problem is that granularity of registration is HW page (4K), so last > >> page of the buffer may contain also other application's data and user > >> may be unaware of this and be very surprised by SIGSEGV. If pipeline > >> protocol will send data from the end of a buffer then the last > >> page of > >> the buffer will not be registered (and first page is never registered > >> because we send beginning of the buffer eagerly with rendezvous > >> packet) > >> so this situation will be avoided. It should have zero impact on > >> performance. What do you think? How common for MPI applications to > >> fork()? > >> > >> -- > >>Gleb. > >> ___ > >> devel-core mailing list > >> devel-c...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel-core > > > > > > -- > > Jeff Squyres > > Cisco Systems > > > > ___ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Gleb.
Re: [OMPI devel] [RFC] Send data from the end of a buffer during pipeline proto
On Thu, May 17, 2007 at 12:30:51PM -0400, Patrick Geoffray wrote: > Jeff Squyres wrote: > > Moving to devel; this question seems worthwhile to push out to the > > general development community. > > > > I've been coming across an increasing number of customers and other > > random OMPI users who use system(). So if there's zero impact on > > performance and it doesn't make the code [more] incredibly horrible > > [than it already is], I'm in favor of this change. > > I will sound like a broken record, but this is the type of things that a > MPI implementation should not care about. At least not in the (common) > protocol layer. That's why the BTL-level abstraction is a bad one, > device-specific problems bubble up instead of staying hidden in > device-specific code. I am glad I provided one more point in favor of your argument :) > > > On May 17, 2007, at 7:00 AM, Gleb Natapov wrote: > > >> problem is that granularity of registration is HW page (4K), so last > > What about huge pages ? I am saying this again I don't try to solve all problems of interconnects that were designed by people who ignored 30 or so years of OS design evolution. Huge page usage is not transparent in linux. If programmer decided to use it he should understand the consequences. > > >> page of the buffer may contain also other application's data and user > >> may be unaware of this and be very surprised by SIGSEGV. If pipeline > > How can a process get a segmentation fault by accessing a page mapped in > its own address space ? In child process VMAs, that were registered in parent process, no longer mapped. > > >> so this situation will be avoided. It should have zero impact on > >> performance. What do you think? How common for MPI applications to > >> fork()? > > The only safe way to support fork() with pinned page is to force the > duplication of pages at fork time. It makes fork much more expensive, > but fork should not be in the critical path of HPC applications anyway. > It also increases a chance of fork() failure. Otherwise I agree with you. I even started to write patch once to duplicate only firs and last pages of pinned region. The chances of such patch to be accepted into linux less then zero though. > Playing with registration cache is playing with fire. The change I propose will not solve any problem if registration cache is in use. -- Gleb.
Re: [OMPI devel] [RFC] Send data from the end of a buffer during pipeline proto
On Thu, May 17, 2007 at 02:35:02PM -0400, Patrick Geoffray wrote: > Brian Barrett wrote: > > On the other hand, since the MPI standard explicitly says you're not > > allowed to call fork() or system() during the MPI application and > > Does it ? The MPI spec says that you should not access buffers that have > been committed to MPI (pending asynchronous send or recv buffer for > example). It does not care about page boundary and pinning side effects. That is exactly what I am trying to achieve by proposed change. Child will not be bale to touch memory that was committed to MPI in time of fork(), but all other memory will be safe. This is not the case currently with IB (even when registration cache is _not_ in use). > > The fork() problem is due to memory registration aggravated by > registration cache. Memory registration in itself is a hack from the OS > point of view, and you already know a lot about the various problems > related to registration cache. > > The right way to fix the fork problem is to fix the memory registration > problem in the OS itself. It's not going to happen anytime soon, so it > requires another hack (forcing VM duplication of registered pages at > fork time). > -- Gleb.
Re: [OMPI devel] [RFC] Send data from the end of a buffer during pipeline proto
On Thu, May 17, 2007 at 02:57:22PM -0400, Patrick Geoffray wrote: > gshipman wrote: > >> The fork() problem is due to memory registration aggravated by > >> registration cache. Memory registration in itself is a hack from > >> the OS > >> point of view, and you already know a lot about the various problems > >> related to registration cache. > >> > > So Gleb is indicating that this is a problem in the pipeline protocol > > which does not use a registration cache. I think the registration > > cache, while increasing the probability of badness after fork, is not > > the culprit. > > Indeed, it makes things worse by extending the vulnerability outside the > time frame of an asynchronous communication. Without the registration > cache, the bad case is limited to a process that forks while a com is > pending and touches the same pages before they are read/written by the > hardware. This is not very likely because the window of time is very > small, but still possible. However, it is not limited to the last > partial page of the buffer, it can happen for any pinned page. > Now I see that you don't fully understand all of the IB ugliness. Here I explain it to you. In IB QP and CQ also use registered memory that is directly written/read by a hardware (to signal a completion or to get next work request). After fork() parent continues to use IB of cause and most definitely touches QP/CQ memory and at this very moment everything breaks. So to overcome this problem (and to allow IB program fork() at all) new madvice flag was implemented that allows userspace to mark certain memory to not be copied to a child process. This memory is not mapped in a child at all, no even VMA created for it. In the parent this memory is not marked COW. All memory that is registered by IB is marked in this way. So the problem is that if non aligned buffer is committed to MPI it may share a page with some data that child may want to use, but this data will not be present in a child. -- Gleb.
[OMPI devel] Change to default xcast mode [RFC]
For the last several months, we have supported three modes of sending the xcast messages used to release MPI processes from their various stage gates: 1. Direct - message sent directly to each process in a serial fashion 2. Linear - message sent serially to the daemon on each node, which then "fans" it out to the application procs on that node 3. Binomial - message sent via binomial tree algorithm to the daemon on each node, which then "fans" it out to the local application procs To maintain historical consistency, we have defaulted to "direct". However, this is not the more scalable mode. We propose to leave all three of these modes in the system, but to change the default on the OMPI trunk to "linear" so that it will be tested more thoroughly by the automated test suite. Please voice any comments and/or objections. Assuming there is agreement, we will make the switch (solely on the OMPI trunk - this will not impact the 1.2 series) on June 1. Thanks Ralph
Re: [OMPI devel] [devel-core] Change to default xcast mode [RFC]
On 5/18/07 2:06 PM, "Andrew Friedley" wrote: > Why not use the binomial mode? I assume it is faster? Yes, but it doesn't work right this minute (should be fixed soon), and we would prefer to take a small step first. Linear doesn't require any major code change, while binomial requires we more significant changes, particularly regarding how the orteds handle the buffers for re-transmission. Hence, binomial represents a higher degree of risk. We have tested linear on a significant range of environments and we therefore expect to see it work cleanly. Binomial has not been tested very much at this time. Once we have more fully exercised linear, and have had time to fix and test binomial on more environments, then we can try switching the default to binomial. > > What are the MCA params to control this? oob_xcast_mode = {"direct", "linear"} ("binomial" will currently generate an error) > > Can you discuss a little bit about the difference in performance we > might see between the different modes, and why we might use one over the > other? Direct is the slowest due to the number of messages involved. Binomial is the fastest as it more rapidly propagates the startup message across all the procs. Linear is a compromise - it gives you better performance than direct when ppn > 1 (obviously, when ppn=1, there is no benefit at all), but not as good as binomial because we send the messages to each orted independently (instead of via a binomial tree method). > > Andrew > > Ralph H Castain wrote: >> For the last several months, we have supported three modes of sending the >> xcast messages used to release MPI processes from their various stage gates: >> >> 1. Direct - message sent directly to each process in a serial fashion >> >> 2. Linear - message sent serially to the daemon on each node, which then >> "fans" it out to the application procs on that node >> >> 3. Binomial - message sent via binomial tree algorithm to the daemon on each >> node, which then "fans" it out to the local application procs >> >> To maintain historical consistency, we have defaulted to "direct". However, >> this is not the more scalable mode. >> >> We propose to leave all three of these modes in the system, but to change >> the default on the OMPI trunk to "linear" so that it will be tested more >> thoroughly by the automated test suite. >> >> Please voice any comments and/or objections. Assuming there is agreement, we >> will make the switch (solely on the OMPI trunk - this will not impact the >> 1.2 series) on June 1. >> >> Thanks >> Ralph >> >> >> ___ >> devel-core mailing list >> devel-c...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel-core > ___ > devel-core mailing list > devel-c...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel-core
Re: [OMPI devel] [RFC] Send data from the end of a buffer during pipeline proto
Hi Gleb, Gleb Natapov wrote: new madvice flag was implemented that allows userspace to mark certain memory to not be copied to a child process. This memory is not mapped in a child at all, no even VMA created for it. In the parent this memory is Ah, that explain your previous mention of segfault. For static registrations, the ones that are the real problem with fork because of the infinite exposure, it's much simpler to use MAP_SHARED... Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com