[OMPI devel] [RFC] Send data from the end of a buffer during pipeline proto

2007-05-17 Thread Jeff Squyres
Moving to devel; this question seems worthwhile to push out to the  
general development community.


I've been coming across an increasing number of customers and other  
random OMPI users who use system().  So if there's zero impact on  
performance and it doesn't make the code [more] incredibly horrible  
[than it already is], I'm in favor of this change.




On May 17, 2007, at 7:00 AM, Gleb Natapov wrote:


Hi,

 I thought about changing pipeline protocol to send data from the  
end of
the message instead of the middle like it does now. The rationale  
behind

this is better fork() support. When application forks, child doesn't
inherit registered memory, so IB providers educate users to not touch
buffers that were owned by the MPI before fork in a child process. The
problem is that granularity of registration is HW page (4K), so last
page of the buffer may contain also other application's data and user
may be unaware of this and be very surprised by SIGSEGV. If pipeline
protocol will send data from the end of a buffer then the last page of
the buffer will not be registered (and first page is never registered
because we send beginning of the buffer eagerly with rendezvous  
packet)

so this situation will be avoided. It should have zero impact on
performance. What do you think? How common for MPI applications to
fork()?

--
Gleb.
___
devel-core mailing list
devel-c...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel-core



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] [RFC] Send data from the end of a buffer during pipeline proto

2007-05-17 Thread Brian Barrett
On the other hand, since the MPI standard explicitly says you're not  
allowed to call fork() or system() during the MPI application and  
sense the network should really cope with this in some way, if it  
further complicates the code *at all*, I'm strongly against it.   
Especially since it won't really solve the problem.  For example,  
with one-sided, I'm not going to go out of my way to send the first  
and last bit of the buffer so the user can touch those pages while  
calling fork.


Also, if I understand the leave_pinned protocol, this still won't  
really solve anything for the general case -- leave pinned won't send  
any data eagerly if the buffer is already pinned, so there are still  
going to be situations where the user can cause problems.  Now we  
have a situation where sometimes it works and sometimes it doesn't  
and we pretend to support fork()/system() in certain cases.  Seems  
like actually fixing the problem the "right way" would be the right  
path forward...


Brian

On May 17, 2007, at 10:10 AM, Jeff Squyres wrote:


Moving to devel; this question seems worthwhile to push out to the
general development community.

I've been coming across an increasing number of customers and other
random OMPI users who use system().  So if there's zero impact on
performance and it doesn't make the code [more] incredibly horrible
[than it already is], I'm in favor of this change.



On May 17, 2007, at 7:00 AM, Gleb Natapov wrote:


Hi,

 I thought about changing pipeline protocol to send data from the
end of
the message instead of the middle like it does now. The rationale
behind
this is better fork() support. When application forks, child doesn't
inherit registered memory, so IB providers educate users to not touch
buffers that were owned by the MPI before fork in a child process.  
The

problem is that granularity of registration is HW page (4K), so last
page of the buffer may contain also other application's data and user
may be unaware of this and be very surprised by SIGSEGV. If pipeline
protocol will send data from the end of a buffer then the last  
page of

the buffer will not be registered (and first page is never registered
because we send beginning of the buffer eagerly with rendezvous
packet)
so this situation will be avoided. It should have zero impact on
performance. What do you think? How common for MPI applications to
fork()?

--
Gleb.
___
devel-core mailing list
devel-c...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel-core



--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] [RFC] Send data from the end of a buffer during pipeline proto

2007-05-17 Thread Patrick Geoffray

Jeff Squyres wrote:
Moving to devel; this question seems worthwhile to push out to the  
general development community.


I've been coming across an increasing number of customers and other  
random OMPI users who use system().  So if there's zero impact on  
performance and it doesn't make the code [more] incredibly horrible  
[than it already is], I'm in favor of this change.


I will sound like a broken record, but this is the type of things that a 
MPI implementation should not care about. At least not in the (common) 
protocol layer. That's why the BTL-level abstraction is a bad one, 
device-specific problems bubble up instead of staying hidden in 
device-specific code.



On May 17, 2007, at 7:00 AM, Gleb Natapov wrote:



problem is that granularity of registration is HW page (4K), so last


What about huge pages ?


page of the buffer may contain also other application's data and user
may be unaware of this and be very surprised by SIGSEGV. If pipeline


How can a process get a segmentation fault by accessing a page mapped in 
 its own address space ?



so this situation will be avoided. It should have zero impact on
performance. What do you think? How common for MPI applications to
fork()?


The only safe way to support fork() with pinned page is to force the 
duplication of pages at fork time. It makes fork much more expensive, 
but fork should not be in the critical path of HPC applications anyway.


Playing with registration cache is playing with fire.

My 2 cents.

Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com


Re: [OMPI devel] [RFC] Send data from the end of a buffer during pipeline proto

2007-05-17 Thread Patrick Geoffray

Brian Barrett wrote:
On the other hand, since the MPI standard explicitly says you're not  
allowed to call fork() or system() during the MPI application and  


Does it ? The MPI spec says that you should not access buffers that have 
been committed to MPI (pending asynchronous send or recv buffer for 
example). It does not care about page boundary and pinning side effects.


The fork() problem is due to memory registration aggravated by 
registration cache. Memory registration in itself is a hack from the OS 
point of view, and you already know a lot about the various problems 
related to registration cache.


The right way to fix the fork problem is to fix the memory registration 
problem in the OS itself. It's not going to happen anytime soon, so it 
requires another hack (forcing VM duplication of registered pages at 
fork time).


Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com


Re: [OMPI devel] [RFC] Send data from the end of a buffer during pipeline proto

2007-05-17 Thread gshipman


The fork() problem is due to memory registration aggravated by
registration cache. Memory registration in itself is a hack from  
the OS

point of view, and you already know a lot about the various problems
related to registration cache.

So Gleb is indicating that this is a problem in the pipeline protocol  
which does not use a registration cache. I think the registration  
cache, while increasing the probability of badness after fork, is not  
the culprit.





The right way to fix the fork problem is to fix the memory  
registration

problem in the OS itself. It's not going to happen anytime soon, so it
requires another hack (forcing VM duplication of registered pages at
fork time).

Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] [RFC] Send data from the end of a buffer during pipeline proto

2007-05-17 Thread Patrick Geoffray

gshipman wrote:

The fork() problem is due to memory registration aggravated by
registration cache. Memory registration in itself is a hack from  
the OS

point of view, and you already know a lot about the various problems
related to registration cache.

So Gleb is indicating that this is a problem in the pipeline protocol  
which does not use a registration cache. I think the registration  
cache, while increasing the probability of badness after fork, is not  
the culprit.


Indeed, it makes things worse by extending the vulnerability outside the 
time frame of an asynchronous communication. Without the registration 
cache, the bad case is limited to a process that forks while a com is 
pending and touches the same pages before they are read/written by the 
hardware. This is not very likely because the window of time is very 
small, but still possible. However, it is not limited to the last 
partial page of the buffer, it can happen for any pinned page.


Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com