Re: [OMPI devel] [RFC] Send data from the end of a buffer during pipeline proto

2007-05-18 Thread Gleb Natapov
On Thu, May 17, 2007 at 10:20:35AM -0600, Brian Barrett wrote:
> On the other hand, since the MPI standard explicitly says you're not  
> allowed to call fork() or system() during the MPI application and  
> sense the network should really cope with this in some way, if it  
> further complicates the code *at all*, I'm strongly against it.   
> Especially since it won't really solve the problem.  For example,  
> with one-sided, I'm not going to go out of my way to send the first  
> and last bit of the buffer so the user can touch those pages while  
> calling fork.
> 
> Also, if I understand the leave_pinned protocol, this still won't  
> really solve anything for the general case -- leave pinned won't send  
> any data eagerly if the buffer is already pinned, so there are still  
> going to be situations where the user can cause problems.  Now we  
> have a situation where sometimes it works and sometimes it doesn't  
> and we pretend to support fork()/system() in certain cases.  Seems  
> like actually fixing the problem the "right way" would be the right  
> path forward...

This will not solve all the problems, it just slightly decries a chance
of a program to get SIGSEGV. We will not going to pretend that we
support fork() or system(). Obviously this change will not help to one
sided, leave_pinned or leave_pinned_pipeline cases. About "complicating
the code" issue I am working on solving deadlock in pipeline protocol
and for this I need a capability to send any part of a message by copy
in/out. The change I propose will be trivial to do on top of this. The
code will be more complex because of deadlock issues and not because of
the change we discuss now :)

> 
> Brian
> 
> On May 17, 2007, at 10:10 AM, Jeff Squyres wrote:
> 
> > Moving to devel; this question seems worthwhile to push out to the
> > general development community.
> >
> > I've been coming across an increasing number of customers and other
> > random OMPI users who use system().  So if there's zero impact on
> > performance and it doesn't make the code [more] incredibly horrible
> > [than it already is], I'm in favor of this change.
> >
> >
> >
> > On May 17, 2007, at 7:00 AM, Gleb Natapov wrote:
> >
> >> Hi,
> >>
> >>  I thought about changing pipeline protocol to send data from the
> >> end of
> >> the message instead of the middle like it does now. The rationale
> >> behind
> >> this is better fork() support. When application forks, child doesn't
> >> inherit registered memory, so IB providers educate users to not touch
> >> buffers that were owned by the MPI before fork in a child process.  
> >> The
> >> problem is that granularity of registration is HW page (4K), so last
> >> page of the buffer may contain also other application's data and user
> >> may be unaware of this and be very surprised by SIGSEGV. If pipeline
> >> protocol will send data from the end of a buffer then the last  
> >> page of
> >> the buffer will not be registered (and first page is never registered
> >> because we send beginning of the buffer eagerly with rendezvous
> >> packet)
> >> so this situation will be avoided. It should have zero impact on
> >> performance. What do you think? How common for MPI applications to
> >> fork()?
> >>
> >> --
> >>Gleb.
> >> ___
> >> devel-core mailing list
> >> devel-c...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel-core
> >
> >
> > -- 
> > Jeff Squyres
> > Cisco Systems
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Gleb.


Re: [OMPI devel] [RFC] Send data from the end of a buffer during pipeline proto

2007-05-18 Thread Gleb Natapov
On Thu, May 17, 2007 at 12:30:51PM -0400, Patrick Geoffray wrote:
> Jeff Squyres wrote:
> > Moving to devel; this question seems worthwhile to push out to the  
> > general development community.
> > 
> > I've been coming across an increasing number of customers and other  
> > random OMPI users who use system().  So if there's zero impact on  
> > performance and it doesn't make the code [more] incredibly horrible  
> > [than it already is], I'm in favor of this change.
> 
> I will sound like a broken record, but this is the type of things that a 
> MPI implementation should not care about. At least not in the (common) 
> protocol layer. That's why the BTL-level abstraction is a bad one, 
> device-specific problems bubble up instead of staying hidden in 
> device-specific code.
I am glad I provided one more point in favor of your argument :)

> 
> > On May 17, 2007, at 7:00 AM, Gleb Natapov wrote:
> 
> >> problem is that granularity of registration is HW page (4K), so last
> 
> What about huge pages ?
I am saying this again I don't try to solve all problems of interconnects that
were designed by people who ignored 30 or so years of OS design evolution.

Huge page usage is not transparent in linux. If programmer decided to
use it he should understand the consequences.
> 
> >> page of the buffer may contain also other application's data and user
> >> may be unaware of this and be very surprised by SIGSEGV. If pipeline
> 
> How can a process get a segmentation fault by accessing a page mapped in 
>   its own address space ?
In child process VMAs, that were registered in parent process, no longer
mapped.

> 
> >> so this situation will be avoided. It should have zero impact on
> >> performance. What do you think? How common for MPI applications to
> >> fork()?
> 
> The only safe way to support fork() with pinned page is to force the 
> duplication of pages at fork time. It makes fork much more expensive, 
> but fork should not be in the critical path of HPC applications anyway.
> 
It also increases a chance of fork() failure. Otherwise I agree with
you. I even started to write patch once to duplicate only firs and last
pages of pinned region. The chances of such patch to be accepted into
linux less then zero though.


> Playing with registration cache is playing with fire.
The change I propose will not solve any problem if registration cache is
in use.

--
Gleb.


Re: [OMPI devel] [RFC] Send data from the end of a buffer during pipeline proto

2007-05-18 Thread Gleb Natapov
On Thu, May 17, 2007 at 02:35:02PM -0400, Patrick Geoffray wrote:
> Brian Barrett wrote:
> > On the other hand, since the MPI standard explicitly says you're not  
> > allowed to call fork() or system() during the MPI application and  
> 
> Does it ? The MPI spec says that you should not access buffers that have 
> been committed to MPI (pending asynchronous send or recv buffer for 
> example). It does not care about page boundary and pinning side effects.
That is exactly what I am trying to achieve by proposed change. Child
will not be bale to touch memory that was committed to MPI in time of
fork(), but all other memory will be safe. This is not the case
currently with IB (even when registration cache is _not_ in use).

> 
> The fork() problem is due to memory registration aggravated by 
> registration cache. Memory registration in itself is a hack from the OS 
> point of view, and you already know a lot about the various problems 
> related to registration cache.
> 
> The right way to fix the fork problem is to fix the memory registration 
> problem in the OS itself. It's not going to happen anytime soon, so it 
> requires another hack (forcing VM duplication of registered pages at 
> fork time).
> 

--
Gleb.


Re: [OMPI devel] [RFC] Send data from the end of a buffer during pipeline proto

2007-05-18 Thread Gleb Natapov
On Thu, May 17, 2007 at 02:57:22PM -0400, Patrick Geoffray wrote:
> gshipman wrote:
> >> The fork() problem is due to memory registration aggravated by
> >> registration cache. Memory registration in itself is a hack from  
> >> the OS
> >> point of view, and you already know a lot about the various problems
> >> related to registration cache.
> >>
> > So Gleb is indicating that this is a problem in the pipeline protocol  
> > which does not use a registration cache. I think the registration  
> > cache, while increasing the probability of badness after fork, is not  
> > the culprit.
> 
> Indeed, it makes things worse by extending the vulnerability outside the 
> time frame of an asynchronous communication. Without the registration 
> cache, the bad case is limited to a process that forks while a com is 
> pending and touches the same pages before they are read/written by the 
> hardware. This is not very likely because the window of time is very 
> small, but still possible. However, it is not limited to the last 
> partial page of the buffer, it can happen for any pinned page.
> 
Now I see that you don't fully understand all of the IB ugliness. Here I
explain it to you. In IB QP and CQ also use registered memory that is
directly written/read by a hardware (to signal a completion or to get
next work request). After fork() parent continues to use IB of cause and
most definitely touches QP/CQ memory and at this very moment everything
breaks. So to overcome this problem (and to allow IB program fork() at all)
new madvice flag was implemented that allows userspace to mark certain
memory to not be copied to a child process. This memory is not mapped in
a child at all, no even VMA created for it. In the parent this memory is
not marked COW. All memory that is registered by IB is marked in this
way. So the problem is that if non aligned buffer is committed to MPI
it may share a page with some data that child may want to use, but this
data will not be present in a child.

--
Gleb.


[OMPI devel] Change to default xcast mode [RFC]

2007-05-18 Thread Ralph H Castain
For the last several months, we have supported three modes of sending the
xcast messages used to release MPI processes from their various stage gates:

1. Direct - message sent directly to each process in a serial fashion

2. Linear - message sent serially to the daemon on each node, which then
"fans" it out to the application procs on that node

3. Binomial - message sent via binomial tree algorithm to the daemon on each
node, which then "fans" it out to the local application procs

To maintain historical consistency, we have defaulted to "direct". However,
this is not the more scalable mode.

We propose to leave all three of these modes in the system, but to change
the default on the OMPI trunk to "linear" so that it will be tested more
thoroughly by the automated test suite.

Please voice any comments and/or objections. Assuming there is agreement, we
will make the switch (solely on the OMPI trunk - this will not impact the
1.2 series) on June 1.

Thanks
Ralph




Re: [OMPI devel] [devel-core] Change to default xcast mode [RFC]

2007-05-18 Thread Ralph H Castain



On 5/18/07 2:06 PM, "Andrew Friedley"  wrote:

> Why not use the binomial mode?  I assume it is faster?

Yes, but it doesn't work right this minute (should be fixed soon), and we
would prefer to take a small step first. Linear doesn't require any major
code change, while binomial requires we more significant changes,
particularly regarding how the orteds handle the buffers for
re-transmission. Hence, binomial represents a higher degree of risk.

We have tested linear on a significant range of environments and we
therefore expect to see it work cleanly. Binomial has not been tested very
much at this time. Once we have more fully exercised linear, and have had
time to fix and test binomial on more environments, then we can try
switching the default to binomial.

> 
> What are the MCA params to control this?

oob_xcast_mode = {"direct", "linear"} ("binomial" will currently generate an
error)

> 
> Can you discuss a little bit about the difference in performance we
> might see between the different modes, and why we might use one over the
> other?

Direct is the slowest due to the number of messages involved. Binomial is
the fastest as it more rapidly propagates the startup message across all the
procs.

Linear is a compromise - it gives you better performance than direct when
ppn > 1 (obviously, when ppn=1, there is no benefit at all), but not as good
as binomial because we send the messages to each orted independently
(instead of via a binomial tree method).


> 
> Andrew
> 
> Ralph H Castain wrote:
>> For the last several months, we have supported three modes of sending the
>> xcast messages used to release MPI processes from their various stage gates:
>> 
>> 1. Direct - message sent directly to each process in a serial fashion
>> 
>> 2. Linear - message sent serially to the daemon on each node, which then
>> "fans" it out to the application procs on that node
>> 
>> 3. Binomial - message sent via binomial tree algorithm to the daemon on each
>> node, which then "fans" it out to the local application procs
>> 
>> To maintain historical consistency, we have defaulted to "direct". However,
>> this is not the more scalable mode.
>> 
>> We propose to leave all three of these modes in the system, but to change
>> the default on the OMPI trunk to "linear" so that it will be tested more
>> thoroughly by the automated test suite.
>> 
>> Please voice any comments and/or objections. Assuming there is agreement, we
>> will make the switch (solely on the OMPI trunk - this will not impact the
>> 1.2 series) on June 1.
>> 
>> Thanks
>> Ralph
>> 
>> 
>> ___
>> devel-core mailing list
>> devel-c...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel-core
> ___
> devel-core mailing list
> devel-c...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel-core




Re: [OMPI devel] [RFC] Send data from the end of a buffer during pipeline proto

2007-05-18 Thread Patrick Geoffray

Hi Gleb,

Gleb Natapov wrote:

new madvice flag was implemented that allows userspace to mark certain
memory to not be copied to a child process. This memory is not mapped in
a child at all, no even VMA created for it. In the parent this memory is


Ah, that explain your previous mention of segfault. For static 
registrations, the ones that are the real problem with fork because of 
the infinite exposure, it's much simpler to use MAP_SHARED...


Patrick
--
Patrick Geoffray
Myricom, Inc.
http://www.myri.com