[OMPI devel] [RFC] Send data from the end of a buffer during pipeline proto
Moving to devel; this question seems worthwhile to push out to the general development community. I've been coming across an increasing number of customers and other random OMPI users who use system(). So if there's zero impact on performance and it doesn't make the code [more] incredibly horrible [than it already is], I'm in favor of this change. On May 17, 2007, at 7:00 AM, Gleb Natapov wrote: Hi, I thought about changing pipeline protocol to send data from the end of the message instead of the middle like it does now. The rationale behind this is better fork() support. When application forks, child doesn't inherit registered memory, so IB providers educate users to not touch buffers that were owned by the MPI before fork in a child process. The problem is that granularity of registration is HW page (4K), so last page of the buffer may contain also other application's data and user may be unaware of this and be very surprised by SIGSEGV. If pipeline protocol will send data from the end of a buffer then the last page of the buffer will not be registered (and first page is never registered because we send beginning of the buffer eagerly with rendezvous packet) so this situation will be avoided. It should have zero impact on performance. What do you think? How common for MPI applications to fork()? -- Gleb. ___ devel-core mailing list devel-c...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel-core -- Jeff Squyres Cisco Systems
Re: [OMPI devel] [RFC] Send data from the end of a buffer during pipeline proto
On the other hand, since the MPI standard explicitly says you're not allowed to call fork() or system() during the MPI application and sense the network should really cope with this in some way, if it further complicates the code *at all*, I'm strongly against it. Especially since it won't really solve the problem. For example, with one-sided, I'm not going to go out of my way to send the first and last bit of the buffer so the user can touch those pages while calling fork. Also, if I understand the leave_pinned protocol, this still won't really solve anything for the general case -- leave pinned won't send any data eagerly if the buffer is already pinned, so there are still going to be situations where the user can cause problems. Now we have a situation where sometimes it works and sometimes it doesn't and we pretend to support fork()/system() in certain cases. Seems like actually fixing the problem the "right way" would be the right path forward... Brian On May 17, 2007, at 10:10 AM, Jeff Squyres wrote: Moving to devel; this question seems worthwhile to push out to the general development community. I've been coming across an increasing number of customers and other random OMPI users who use system(). So if there's zero impact on performance and it doesn't make the code [more] incredibly horrible [than it already is], I'm in favor of this change. On May 17, 2007, at 7:00 AM, Gleb Natapov wrote: Hi, I thought about changing pipeline protocol to send data from the end of the message instead of the middle like it does now. The rationale behind this is better fork() support. When application forks, child doesn't inherit registered memory, so IB providers educate users to not touch buffers that were owned by the MPI before fork in a child process. The problem is that granularity of registration is HW page (4K), so last page of the buffer may contain also other application's data and user may be unaware of this and be very surprised by SIGSEGV. If pipeline protocol will send data from the end of a buffer then the last page of the buffer will not be registered (and first page is never registered because we send beginning of the buffer eagerly with rendezvous packet) so this situation will be avoided. It should have zero impact on performance. What do you think? How common for MPI applications to fork()? -- Gleb. ___ devel-core mailing list devel-c...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel-core -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [RFC] Send data from the end of a buffer during pipeline proto
Jeff Squyres wrote: Moving to devel; this question seems worthwhile to push out to the general development community. I've been coming across an increasing number of customers and other random OMPI users who use system(). So if there's zero impact on performance and it doesn't make the code [more] incredibly horrible [than it already is], I'm in favor of this change. I will sound like a broken record, but this is the type of things that a MPI implementation should not care about. At least not in the (common) protocol layer. That's why the BTL-level abstraction is a bad one, device-specific problems bubble up instead of staying hidden in device-specific code. On May 17, 2007, at 7:00 AM, Gleb Natapov wrote: problem is that granularity of registration is HW page (4K), so last What about huge pages ? page of the buffer may contain also other application's data and user may be unaware of this and be very surprised by SIGSEGV. If pipeline How can a process get a segmentation fault by accessing a page mapped in its own address space ? so this situation will be avoided. It should have zero impact on performance. What do you think? How common for MPI applications to fork()? The only safe way to support fork() with pinned page is to force the duplication of pages at fork time. It makes fork much more expensive, but fork should not be in the critical path of HPC applications anyway. Playing with registration cache is playing with fire. My 2 cents. Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com
Re: [OMPI devel] [RFC] Send data from the end of a buffer during pipeline proto
Brian Barrett wrote: On the other hand, since the MPI standard explicitly says you're not allowed to call fork() or system() during the MPI application and Does it ? The MPI spec says that you should not access buffers that have been committed to MPI (pending asynchronous send or recv buffer for example). It does not care about page boundary and pinning side effects. The fork() problem is due to memory registration aggravated by registration cache. Memory registration in itself is a hack from the OS point of view, and you already know a lot about the various problems related to registration cache. The right way to fix the fork problem is to fix the memory registration problem in the OS itself. It's not going to happen anytime soon, so it requires another hack (forcing VM duplication of registered pages at fork time). Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com
Re: [OMPI devel] [RFC] Send data from the end of a buffer during pipeline proto
The fork() problem is due to memory registration aggravated by registration cache. Memory registration in itself is a hack from the OS point of view, and you already know a lot about the various problems related to registration cache. So Gleb is indicating that this is a problem in the pipeline protocol which does not use a registration cache. I think the registration cache, while increasing the probability of badness after fork, is not the culprit. The right way to fix the fork problem is to fix the memory registration problem in the OS itself. It's not going to happen anytime soon, so it requires another hack (forcing VM duplication of registered pages at fork time). Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [RFC] Send data from the end of a buffer during pipeline proto
gshipman wrote: The fork() problem is due to memory registration aggravated by registration cache. Memory registration in itself is a hack from the OS point of view, and you already know a lot about the various problems related to registration cache. So Gleb is indicating that this is a problem in the pipeline protocol which does not use a registration cache. I think the registration cache, while increasing the probability of badness after fork, is not the culprit. Indeed, it makes things worse by extending the vulnerability outside the time frame of an asynchronous communication. Without the registration cache, the bad case is limited to a process that forks while a com is pending and touches the same pages before they are read/written by the hardware. This is not very likely because the window of time is very small, but still possible. However, it is not limited to the last partial page of the buffer, it can happen for any pinned page. Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com