Re: [OMPI devel] collective problems

2007-11-07 Thread Shipman, Galen M.
The lengths we go to avoid progress :-)




On 11/7/07 10:19 PM, "Richard Graham"  wrote:

> The real problem, as you and others have pointed out is the lack of
> predictable time slices for the progress engine to do its work, when relying
> on the ULP to make calls into the library...
> 
> Rich
> 
> 
> On 11/8/07 12:07 AM, "Brian Barrett"  wrote:
> 
>> As it stands today, the problem is that we can inject things into the
>> BTL successfully that are not injected into the NIC (due to software
>> flow control).  Once a message is injected into the BTL, the PML marks
>> completion on the MPI request.  If it was a blocking send that got
>> marked as complete, but the message isn't injected into the NIC/NIC
>> library, and the user doesn't re-enter the MPI library for a
>> considerable amount of time, then we have a problem.
>> 
>> Personally, I'd rather just not mark MPI completion until a local
>> completion callback from the BTL.  But others don't like that idea, so
>> we came up with a way for back pressure from the BTL to say "it's not
>> on the wire yet".  This is more complicated than just not marking MPI
>> completion early, but why would we do something that helps real apps
>> at the expense of benchmarks?  That would just be silly!
>> 
>> Brian
>> 
>> On Nov 7, 2007, at 7:56 PM, Richard Graham wrote:
>> 
>>> Does this mean that we don¹t have a queue to store btl level
>>> descriptors that
>>>  are only partially complete ?  Do we do an all or nothing with
>>> respect to btl
>>>  level requests at this stage ?
>>> 
>>> Seems to me like we want to mark things complete at the MPI level
>>> ASAP, and
>>>  that this proposal is not to do that ­ is this correct ?
>>> 
>>> Rich
>>> 
>>> 
>>> On 11/7/07 11:26 PM, "Jeff Squyres"  wrote:
>>> 
 On Nov 7, 2007, at 9:33 PM, Patrick Geoffray wrote:
 
>> Remember that this is all in the context of Galen's proposal for
>> btl_send() to be able to return NOT_ON_WIRE -- meaning that the
 send
>> was successful, but it has not yet been sent (e.g., openib BTL
>> buffered it because it ran out of credits).
> 
> Sorry if I miss something obvious, but why does the PML has to be
> aware
> of the flow control situation of the BTL ? If the BTL cannot send
> something right away for any reason, it should be the
 responsibility
> of
> the BTL to buffer it and to progress on it later.
 
 
 That's currently the way it is.  But the BTL currently only has the
 option to say two things:
 
 1. "ok, done!" -- then the PML will think that the request is
 complete
 2. "doh -- error!" -- then the PML thinks that Something Bad
 Happened(tm)
 
 What we really need is for the BTL to have a third option:
 
 3. "not done yet!"
 
 So that the PML knows that the request is not yet done, but will
 allow
 other things to progress while we're waiting for it to complete.
 Without this, the openib BTL currently replies "ok, done!", even when
 it has only buffered a message (rather than actually sending it out).
 This optimization works great (yeah, I know...) except for apps that
 don't dip into the MPI library frequently.  :-\
 
 --
 Jeff Squyres
 Cisco Systems
 
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
 
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] collective problems

2007-11-07 Thread Brian Barrett
As it stands today, the problem is that we can inject things into the  
BTL successfully that are not injected into the NIC (due to software  
flow control).  Once a message is injected into the BTL, the PML marks  
completion on the MPI request.  If it was a blocking send that got  
marked as complete, but the message isn't injected into the NIC/NIC  
library, and the user doesn't re-enter the MPI library for a  
considerable amount of time, then we have a problem.


Personally, I'd rather just not mark MPI completion until a local  
completion callback from the BTL.  But others don't like that idea, so  
we came up with a way for back pressure from the BTL to say "it's not  
on the wire yet".  This is more complicated than just not marking MPI  
completion early, but why would we do something that helps real apps  
at the expense of benchmarks?  That would just be silly!


Brian

On Nov 7, 2007, at 7:56 PM, Richard Graham wrote:

Does this mean that we don’t have a queue to store btl level  
descriptors that
 are only partially complete ?  Do we do an all or nothing with  
respect to btl

 level requests at this stage ?

Seems to me like we want to mark things complete at the MPI level  
ASAP, and

 that this proposal is not to do that – is this correct ?

Rich


On 11/7/07 11:26 PM, "Jeff Squyres"  wrote:


On Nov 7, 2007, at 9:33 PM, Patrick Geoffray wrote:

>> Remember that this is all in the context of Galen's proposal for
>> btl_send() to be able to return NOT_ON_WIRE -- meaning that the  
send

>> was successful, but it has not yet been sent (e.g., openib BTL
>> buffered it because it ran out of credits).
>
> Sorry if I miss something obvious, but why does the PML has to be
> aware
> of the flow control situation of the BTL ? If the BTL cannot send
> something right away for any reason, it should be the  
responsibility

> of
> the BTL to buffer it and to progress on it later.


That's currently the way it is.  But the BTL currently only has the
option to say two things:

1. "ok, done!" -- then the PML will think that the request is  
complete

2. "doh -- error!" -- then the PML thinks that Something Bad
Happened(tm)

What we really need is for the BTL to have a third option:

3. "not done yet!"

So that the PML knows that the request is not yet done, but will  
allow

other things to progress while we're waiting for it to complete.
Without this, the openib BTL currently replies "ok, done!", even when
it has only buffered a message (rather than actually sending it out).
This optimization works great (yeah, I know...) except for apps that
don't dip into the MPI library frequently.  :-\

--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





Re: [OMPI devel] collective problems

2007-11-07 Thread Jeff Squyres

On Nov 7, 2007, at 9:33 PM, Patrick Geoffray wrote:


Remember that this is all in the context of Galen's proposal for
btl_send() to be able to return NOT_ON_WIRE -- meaning that the send
was successful, but it has not yet been sent (e.g., openib BTL
buffered it because it ran out of credits).


Sorry if I miss something obvious, but why does the PML has to be  
aware

of the flow control situation of the BTL ? If the BTL cannot send
something right away for any reason, it should be the responsibility  
of

the BTL to buffer it and to progress on it later.



That's currently the way it is.  But the BTL currently only has the  
option to say two things:


1. "ok, done!" -- then the PML will think that the request is complete
2. "doh -- error!" -- then the PML thinks that Something Bad  
Happened(tm)


What we really need is for the BTL to have a third option:

3. "not done yet!"

So that the PML knows that the request is not yet done, but will allow  
other things to progress while we're waiting for it to complete.   
Without this, the openib BTL currently replies "ok, done!", even when  
it has only buffered a message (rather than actually sending it out).   
This optimization works great (yeah, I know...) except for apps that  
don't dip into the MPI library frequently.  :-\


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] collective problems

2007-11-07 Thread Patrick Geoffray

Jeff Squyres wrote:

This is not a problem in the current code base.

Remember that this is all in the context of Galen's proposal for  
btl_send() to be able to return NOT_ON_WIRE -- meaning that the send  
was successful, but it has not yet been sent (e.g., openib BTL  
buffered it because it ran out of credits).


Sorry if I miss something obvious, but why does the PML has to be aware 
of the flow control situation of the BTL ? If the BTL cannot send 
something right away for any reason, it should be the responsibility of 
the BTL to buffer it and to progress on it later.


Patrick


Re: [OMPI devel] Multiworld MCA parameter values broken

2007-11-07 Thread Tim Prins
I'm curious what changed to make this a problem. How were we passing mca param 
from the base to the app before, and why did it change?

I think that options 1 & 2 below are no good, since we, in general, allow 
string mca params to have spaces (as far as I understand it). So a more 
general approach is needed. 

Tim

On Wednesday 07 November 2007 10:40:45 am Ralph H Castain wrote:
> Sorry for delay - wasn't ignoring the issue.
>
> There are several fixes to this problem - ranging in order from least to
> most work:
>
> 1. just alias "ssh" to be "ssh -Y" and run without setting the mca param.
> It won't affect anything on the backend because the daemon/procs don't use
> ssh.
>
> 2. include "pls_rsh_agent" in the array of mca params not to be passed to
> the orted in orte/mca/pls/base/pls_base_general_support_fns.c, the
> orte_pls_base_orted_append_basic_args function. This would fix the specific
> problem cited here, but I admit that listing every such param by name would
> get tedious.
>
> 3. we could easily detect that a "problem" character was in the mca param
> value when we add it to the orted's argv, and then put "" around it. The
> problem, however, is that the mca param parser on the far end doesn't
> remove those "" from the resulting string. At least, I spent over a day
> fighting with a problem only to discover that was happening. Could be an
> error in the way I was doing things, or could be a real characteristic of
> the parser. Anyway, we would have to ensure that the parser removes any
> surrounding "" before passing along the param value or this won't work.
>
> Ralph
>
> On 11/5/07 12:10 PM, "Tim Prins"  wrote:
> > Hi,
> >
> > Commit 16364 broke things when using multiword mca param values. For
> > instance:
> >
> > mpirun --debug-daemons -mca orte_debug 1 -mca pls rsh -mca pls_rsh_agent
> > "ssh -Y" xterm
> >
> > Will crash and burn, because the value "ssh -Y" is being stored into the
> > argv orted_cmd_line in orterun.c:1506. This is then added to the launch
> > command for the orted:
> >
> > /usr/bin/ssh -Y odin004  PATH=/san/homedirs/tprins/usr/rsl/bin:$PATH ;
> > export PATH ;
> > LD_LIBRARY_PATH=/san/homedirs/tprins/usr/rsl/lib:$LD_LIBRARY_PATH ;
> > export LD_LIBRARY_PATH ; /san/homedirs/tprins/usr/rsl/bin/orted --debug
> > --debug-daemons --name 0.1 --num_procs 2 --vpid_start 0 --nodename
> > odin004 --universe tpr...@odin.cs.indiana.edu:default-universe-27872
> > --nsreplica
> > "0.0;tcp://129.79.240.100:40907;tcp6://2001:18e8:2:240:2e0:81ff:fe2d:21a0
> >:4090 8"
> > --gprreplica
> > "0.0;tcp://129.79.240.100:40907;tcp6://2001:18e8:2:240:2e0:81ff:fe2d:21a0
> >:4090 8"
> > -mca orte_debug 1 -mca pls_rsh_agent ssh -Y -mca
> > mca_base_param_file_path
> > /u/tprins/usr/rsl/share/openmpi/amca-param-sets:/san/homedirs/tprins/rsl/
> >examp les
> > -mca mca_base_param_file_path_force /san/homedirs/tprins/rsl/examples
> >
> > Notice that in this command we now have "-mca pls_rsh_agent ssh -Y". So
> > the quotes have been lost, as we die a horrible death.
> >
> > So we need to add the quotes back in somehow, or pass these options
> > differently. I'm not sure what the best way to fix this.
> >
> > Thanks,
> >
> > Tim




Re: [OMPI devel] collective problems

2007-11-07 Thread Jeff Squyres

This is not a problem in the current code base.

Remember that this is all in the context of Galen's proposal for  
btl_send() to be able to return NOT_ON_WIRE -- meaning that the send  
was successful, but it has not yet been sent (e.g., openib BTL  
buffered it because it ran out of credits).


Read these two messages again to get the context:

http://www.open-mpi.org/community/lists/devel/2007/10/2486.php
http://www.open-mpi.org/community/lists/devel/2007/10/2487.php

Gleb describes the recursive problem (paired with the concept of  
NOT_ON_WIRE) nicely in his post.


Make sense?



On Nov 7, 2007, at 1:16 PM, George Bosilca wrote:



On Nov 7, 2007, at 12:51 PM, Jeff Squyres wrote:


The same callback is called in both cases. In the case that you
described, the callback is called just a little bit deeper into the
recursion, when in the "normal case" it will get called from the
first level of the recursion. Or maybe I miss something here ...


Right -- it's not the callback that is the problem.  It's when the
recursion is unwound and further up the stack you now have a stale
request.


That's exactly the point that I fail to see. If the request is freed  
in the PML callback, then it should get release in both cases, and  
therefore lead to problems all the time. Which, obviously, is not  
true when we do not have this deep recursion thing going on.


Moreover, he request management is based on the reference count. The  
PML level have one ref count and the MPI level have another one. In  
fact, we cannot release a request until we explicitly call  
ompi_request_free on it. The place where this call happens is  
different between the blocking and non blocking calls. In the non  
blocking case the ompi_request_free get called from the *_test  
(*_wait) functions while in the blocking case it get called directly  
from the MPI_Send function.


Let me summarize: a request cannot reach a stale state without a  
call to ompi_request_free. This function is never called directly  
from the PML level. Therefore, the recursion depth should not have  
any impact on the state of the request !


Is there a simple test case I can run in order to trigger this  
strange behavior ?


 Thanks,
   george.







george.


This is *only* a problem for requests that are involved from the
current top-level MPI call.  Request from prior calls to MPI
functions
(e.g., a request from a prior call to MPI_ISEND) are ok because a)
we've already done the Right Things to ensure the safety of that
request, and b) that request is not on the recursive stack anywhere
to
become stale as the recursion unwinds.

Right?

If so, Galen proposes the following:

1. in conjunction with the NOT_ON_WIRE proposal...

2. make a new PML request flag DONT_FREE_ME (or some better
name :-) ).

3. blocking MPI_SEND/MPI_RECV calls will set this flag (or, more
specifically, the top of the PML calls for blocking send/receive)
right when the request is allocated (i.e., before calling
btl_send()).

4. when the PML is called for completion on this request, it will  
do
all the stuff that it needs to effect completion -- but then it  
will

see the DONT_FREE_ME flag and not actually free the request.
Obviously, if DONT_FREE_ME is *not* set, then the PML does what it
does today: it frees the request.

5. the top-level PML call will eventually complete:
5a. For blocking PML calls (e.g., corresponding to MPI_SEND and
MPI_RECV), the request can be unconditionally freed.
5b. For non-blocking PML calls (e.g., corresponding to MPI_ISEND),
only free the request if it was completed.

Note that with this scheme, it becomes irrelevant as to whether the
PML completion call is invoked on the first descent into the BTL or
recursively via opal_progress.

How does that sound?

If that all works, it might be beneficial to put this back to the  
1.2
branch because there are definitely apps that would benefit from  
it.




On Oct 23, 2007, at 10:19 AM, Gleb Natapov wrote:


So this problem goes WAY back..

The problem here is that the PML marks MPI completion just  
prior to

calling
btl_send and then returns to the user. This wouldn't be a problem
if the BTL
then did something, but in the case of OpenIB this fragment may  
not

actually
be on the wire (the joys of user level flow control).

One solution that we proposed was to allow btl_send to return
either
OMPI_SUCCESS or OMPI_NOT_ON_WIRE. OMPI_NOT_ON_WIRE would allow  
the

PML to
not mark MPI completion of the fragment and then MPI_WAITALL and
others will
do there job properly.

I even implemented this once, but there is a problem. Currently we
mark
request as completed on MPI level and then do btl_send(). Whenever
IB completion
will happen the request will be marked as complete on PML level  
and

freed. The fix requires to change the order like this: Call
btl_send(),
check return value from BTL and mark request complete as  
necessary.

The
problem is that because we allow BTL to call opal_progress()
internally 

Re: [OMPI devel] collective problems

2007-11-07 Thread George Bosilca


On Nov 7, 2007, at 12:51 PM, Jeff Squyres wrote:


The same callback is called in both cases. In the case that you
described, the callback is called just a little bit deeper into the
recursion, when in the "normal case" it will get called from the
first level of the recursion. Or maybe I miss something here ...


Right -- it's not the callback that is the problem.  It's when the
recursion is unwound and further up the stack you now have a stale
request.


That's exactly the point that I fail to see. If the request is freed  
in the PML callback, then it should get release in both cases, and  
therefore lead to problems all the time. Which, obviously, is not true  
when we do not have this deep recursion thing going on.


Moreover, he request management is based on the reference count. The  
PML level have one ref count and the MPI level have another one. In  
fact, we cannot release a request until we explicitly call  
ompi_request_free on it. The place where this call happens is  
different between the blocking and non blocking calls. In the non  
blocking case the ompi_request_free get called from the *_test  
(*_wait) functions while in the blocking case it get called directly  
from the MPI_Send function.


Let me summarize: a request cannot reach a stale state without a call  
to ompi_request_free. This function is never called directly from the  
PML level. Therefore, the recursion depth should not have any impact  
on the state of the request !


Is there a simple test case I can run in order to trigger this strange  
behavior ?


  Thanks,
george.







george.


This is *only* a problem for requests that are involved from the
current top-level MPI call.  Request from prior calls to MPI
functions
(e.g., a request from a prior call to MPI_ISEND) are ok because a)
we've already done the Right Things to ensure the safety of that
request, and b) that request is not on the recursive stack anywhere
to
become stale as the recursion unwinds.

Right?

If so, Galen proposes the following:

1. in conjunction with the NOT_ON_WIRE proposal...

2. make a new PML request flag DONT_FREE_ME (or some better
name :-) ).

3. blocking MPI_SEND/MPI_RECV calls will set this flag (or, more
specifically, the top of the PML calls for blocking send/receive)
right when the request is allocated (i.e., before calling
btl_send()).

4. when the PML is called for completion on this request, it will do
all the stuff that it needs to effect completion -- but then it will
see the DONT_FREE_ME flag and not actually free the request.
Obviously, if DONT_FREE_ME is *not* set, then the PML does what it
does today: it frees the request.

5. the top-level PML call will eventually complete:
5a. For blocking PML calls (e.g., corresponding to MPI_SEND and
MPI_RECV), the request can be unconditionally freed.
5b. For non-blocking PML calls (e.g., corresponding to MPI_ISEND),
only free the request if it was completed.

Note that with this scheme, it becomes irrelevant as to whether the
PML completion call is invoked on the first descent into the BTL or
recursively via opal_progress.

How does that sound?

If that all works, it might be beneficial to put this back to the  
1.2

branch because there are definitely apps that would benefit from it.



On Oct 23, 2007, at 10:19 AM, Gleb Natapov wrote:


So this problem goes WAY back..

The problem here is that the PML marks MPI completion just prior  
to

calling
btl_send and then returns to the user. This wouldn't be a problem
if the BTL
then did something, but in the case of OpenIB this fragment may  
not

actually
be on the wire (the joys of user level flow control).

One solution that we proposed was to allow btl_send to return
either
OMPI_SUCCESS or OMPI_NOT_ON_WIRE. OMPI_NOT_ON_WIRE would allow the
PML to
not mark MPI completion of the fragment and then MPI_WAITALL and
others will
do there job properly.

I even implemented this once, but there is a problem. Currently we
mark
request as completed on MPI level and then do btl_send(). Whenever
IB completion
will happen the request will be marked as complete on PML level and
freed. The fix requires to change the order like this: Call
btl_send(),
check return value from BTL and mark request complete as necessary.
The
problem is that because we allow BTL to call opal_progress()
internally the
request may be already completed on MPI and MPL levels and freed
before return from
the call to btl_send().

I did a code review to see how hard it will be to get rid of
recursion
in Open MPI and I think this is doable. We have to disallow calling
progress() (or other functions that may call progress() internally)
from
BTL and from ULP callbacks that are called by BTL. There is no much
places that break this law. The main offenders are calls to
FREE_LIST_WAIT(), but those never actually call progress if they  
can

grow without limit and this is the most common use of
FREE_LIST_WAIT()
so they may be safely changed to FREE_LIST_GET(). After we will
solve

Re: [OMPI devel] collective problems

2007-11-07 Thread Jeff Squyres

On Nov 7, 2007, at 12:29 PM, George Bosilca wrote:


I finally talked with Galen and Don about this issue in depth.  Our
understanding is that the "request may get freed before recursion
unwinds" issue is *only* a problem within the context of a single MPI
call (e.g., MPI_SEND).  Is that right?


I wonder how this happens ?

Specifically, if in an MPI_SEND, the BTL ends up buffering the  
message

and setting early completion, but then recurses into opal_progress()
and ends up sending the message and freeing the request during the
recursion, then when the recursion unwinds, the original caller will
have a stale request.


The same callback is called in both cases. In the case that you  
described, the callback is called just a little bit deeper into the  
recursion, when in the "normal case" it will get called from the  
first level of the recursion. Or maybe I miss something here ...


Right -- it's not the callback that is the problem.  It's when the  
recursion is unwound and further up the stack you now have a stale  
request.




 george.


This is *only* a problem for requests that are involved from the
current top-level MPI call.  Request from prior calls to MPI  
functions

(e.g., a request from a prior call to MPI_ISEND) are ok because a)
we've already done the Right Things to ensure the safety of that
request, and b) that request is not on the recursive stack anywhere  
to

become stale as the recursion unwinds.

Right?

If so, Galen proposes the following:

1. in conjunction with the NOT_ON_WIRE proposal...

2. make a new PML request flag DONT_FREE_ME (or some better  
name :-) ).


3. blocking MPI_SEND/MPI_RECV calls will set this flag (or, more
specifically, the top of the PML calls for blocking send/receive)
right when the request is allocated (i.e., before calling  
btl_send()).


4. when the PML is called for completion on this request, it will do
all the stuff that it needs to effect completion -- but then it will
see the DONT_FREE_ME flag and not actually free the request.
Obviously, if DONT_FREE_ME is *not* set, then the PML does what it
does today: it frees the request.

5. the top-level PML call will eventually complete:
5a. For blocking PML calls (e.g., corresponding to MPI_SEND and
MPI_RECV), the request can be unconditionally freed.
5b. For non-blocking PML calls (e.g., corresponding to MPI_ISEND),
only free the request if it was completed.

Note that with this scheme, it becomes irrelevant as to whether the
PML completion call is invoked on the first descent into the BTL or
recursively via opal_progress.

How does that sound?

If that all works, it might be beneficial to put this back to the 1.2
branch because there are definitely apps that would benefit from it.



On Oct 23, 2007, at 10:19 AM, Gleb Natapov wrote:


So this problem goes WAY back..

The problem here is that the PML marks MPI completion just prior to
calling
btl_send and then returns to the user. This wouldn't be a problem
if the BTL
then did something, but in the case of OpenIB this fragment may not
actually
be on the wire (the joys of user level flow control).

One solution that we proposed was to allow btl_send to return  
either

OMPI_SUCCESS or OMPI_NOT_ON_WIRE. OMPI_NOT_ON_WIRE would allow the
PML to
not mark MPI completion of the fragment and then MPI_WAITALL and
others will
do there job properly.

I even implemented this once, but there is a problem. Currently we
mark
request as completed on MPI level and then do btl_send(). Whenever
IB completion
will happen the request will be marked as complete on PML level and
freed. The fix requires to change the order like this: Call
btl_send(),
check return value from BTL and mark request complete as necessary.
The
problem is that because we allow BTL to call opal_progress()
internally the
request may be already completed on MPI and MPL levels and freed
before return from
the call to btl_send().

I did a code review to see how hard it will be to get rid of  
recursion

in Open MPI and I think this is doable. We have to disallow calling
progress() (or other functions that may call progress() internally)
from
BTL and from ULP callbacks that are called by BTL. There is no much
places that break this law. The main offenders are calls to
FREE_LIST_WAIT(), but those never actually call progress if they can
grow without limit and this is the most common use of  
FREE_LIST_WAIT()
so they may be safely changed to FREE_LIST_GET(). After we will  
solve
recursion problem the fix to the problem will be a couple of lines  
of

code.



- Galen



On 10/11/07 11:26 AM, "Gleb Natapov"  wrote:


On Fri, Oct 05, 2007 at 09:43:44AM +0200, Jeff Squyres wrote:

David --

Gleb and I just actively re-looked at this problem yesterday; we
think it's related to https://svn.open-mpi.org/trac/ompi/ticket/
1015.  We previously thought this ticket was a different problem,
but
our analysis yesterday shows that it could be a real problem in  
the

openib BTL or ob1 PML (kinda 

Re: [OMPI devel] collective problems

2007-11-07 Thread George Bosilca


On Nov 7, 2007, at 11:06 AM, Jeff Squyres wrote:


Gleb --

I finally talked with Galen and Don about this issue in depth.  Our
understanding is that the "request may get freed before recursion
unwinds" issue is *only* a problem within the context of a single MPI
call (e.g., MPI_SEND).  Is that right?


I wonder how this happens ?


Specifically, if in an MPI_SEND, the BTL ends up buffering the message
and setting early completion, but then recurses into opal_progress()
and ends up sending the message and freeing the request during the
recursion, then when the recursion unwinds, the original caller will
have a stale request.


The same callback is called in both cases. In the case that you  
described, the callback is called just a little bit deeper into the  
recursion, when in the "normal case" it will get called from the first  
level of the recursion. Or maybe I miss something here ...


  george.


This is *only* a problem for requests that are involved from the
current top-level MPI call.  Request from prior calls to MPI functions
(e.g., a request from a prior call to MPI_ISEND) are ok because a)
we've already done the Right Things to ensure the safety of that
request, and b) that request is not on the recursive stack anywhere to
become stale as the recursion unwinds.

Right?

If so, Galen proposes the following:

1. in conjunction with the NOT_ON_WIRE proposal...

2. make a new PML request flag DONT_FREE_ME (or some better  
name :-) ).


3. blocking MPI_SEND/MPI_RECV calls will set this flag (or, more
specifically, the top of the PML calls for blocking send/receive)
right when the request is allocated (i.e., before calling btl_send()).

4. when the PML is called for completion on this request, it will do
all the stuff that it needs to effect completion -- but then it will
see the DONT_FREE_ME flag and not actually free the request.
Obviously, if DONT_FREE_ME is *not* set, then the PML does what it
does today: it frees the request.

5. the top-level PML call will eventually complete:
5a. For blocking PML calls (e.g., corresponding to MPI_SEND and
MPI_RECV), the request can be unconditionally freed.
5b. For non-blocking PML calls (e.g., corresponding to MPI_ISEND),
only free the request if it was completed.

Note that with this scheme, it becomes irrelevant as to whether the
PML completion call is invoked on the first descent into the BTL or
recursively via opal_progress.

How does that sound?

If that all works, it might be beneficial to put this back to the 1.2
branch because there are definitely apps that would benefit from it.



On Oct 23, 2007, at 10:19 AM, Gleb Natapov wrote:


So this problem goes WAY back..

The problem here is that the PML marks MPI completion just prior to
calling
btl_send and then returns to the user. This wouldn't be a problem
if the BTL
then did something, but in the case of OpenIB this fragment may not
actually
be on the wire (the joys of user level flow control).

One solution that we proposed was to allow btl_send to return either
OMPI_SUCCESS or OMPI_NOT_ON_WIRE. OMPI_NOT_ON_WIRE would allow the
PML to
not mark MPI completion of the fragment and then MPI_WAITALL and
others will
do there job properly.

I even implemented this once, but there is a problem. Currently we
mark
request as completed on MPI level and then do btl_send(). Whenever
IB completion
will happen the request will be marked as complete on PML level and
freed. The fix requires to change the order like this: Call
btl_send(),
check return value from BTL and mark request complete as necessary.
The
problem is that because we allow BTL to call opal_progress()
internally the
request may be already completed on MPI and MPL levels and freed
before return from
the call to btl_send().

I did a code review to see how hard it will be to get rid of  
recursion

in Open MPI and I think this is doable. We have to disallow calling
progress() (or other functions that may call progress() internally)
from
BTL and from ULP callbacks that are called by BTL. There is no much
places that break this law. The main offenders are calls to
FREE_LIST_WAIT(), but those never actually call progress if they can
grow without limit and this is the most common use of  
FREE_LIST_WAIT()

so they may be safely changed to FREE_LIST_GET(). After we will solve
recursion problem the fix to the problem will be a couple of lines of
code.



- Galen



On 10/11/07 11:26 AM, "Gleb Natapov"  wrote:


On Fri, Oct 05, 2007 at 09:43:44AM +0200, Jeff Squyres wrote:

David --

Gleb and I just actively re-looked at this problem yesterday; we
think it's related to https://svn.open-mpi.org/trac/ompi/ticket/
1015.  We previously thought this ticket was a different problem,
but
our analysis yesterday shows that it could be a real problem in  
the

openib BTL or ob1 PML (kinda think it's the openib btl because it
doesn't seem to happen on other networks, but who knows...).

Gleb is investigating.

Here is the result of the 

[OMPI devel] Incorrect one-sided test

2007-11-07 Thread Brian W. Barrett

Hi all -

Lisa Glendenning, who's working on a Portals one-sided component, 
discovered that the test onesided/test_start1.c in our repository is 
incorrect.  It assumes that MPI_Win_start is non-blocking, but the 
standard says that "MPI_WIN_START is allowed to block until the 
corresponding MPI_WIN_POST calls are executed".  The pt2pt and rdma 
components did not block, so the test error did not show up with those 
components.


I've fixed the test in r1223, but thought I'd let everyone know I changed 
one of our conformance tests.


Brian


Re: [OMPI devel] collective problems

2007-11-07 Thread Jeff Squyres

Gleb --

I finally talked with Galen and Don about this issue in depth.  Our  
understanding is that the "request may get freed before recursion  
unwinds" issue is *only* a problem within the context of a single MPI  
call (e.g., MPI_SEND).  Is that right?


Specifically, if in an MPI_SEND, the BTL ends up buffering the message  
and setting early completion, but then recurses into opal_progress()  
and ends up sending the message and freeing the request during the  
recursion, then when the recursion unwinds, the original caller will  
have a stale request.


This is *only* a problem for requests that are involved from the  
current top-level MPI call.  Request from prior calls to MPI functions  
(e.g., a request from a prior call to MPI_ISEND) are ok because a)  
we've already done the Right Things to ensure the safety of that  
request, and b) that request is not on the recursive stack anywhere to  
become stale as the recursion unwinds.


Right?

If so, Galen proposes the following:

1. in conjunction with the NOT_ON_WIRE proposal...

2. make a new PML request flag DONT_FREE_ME (or some better name :-) ).

3. blocking MPI_SEND/MPI_RECV calls will set this flag (or, more  
specifically, the top of the PML calls for blocking send/receive)  
right when the request is allocated (i.e., before calling btl_send()).


4. when the PML is called for completion on this request, it will do  
all the stuff that it needs to effect completion -- but then it will  
see the DONT_FREE_ME flag and not actually free the request.   
Obviously, if DONT_FREE_ME is *not* set, then the PML does what it  
does today: it frees the request.


5. the top-level PML call will eventually complete:
5a. For blocking PML calls (e.g., corresponding to MPI_SEND and  
MPI_RECV), the request can be unconditionally freed.
5b. For non-blocking PML calls (e.g., corresponding to MPI_ISEND),  
only free the request if it was completed.


Note that with this scheme, it becomes irrelevant as to whether the  
PML completion call is invoked on the first descent into the BTL or  
recursively via opal_progress.


How does that sound?

If that all works, it might be beneficial to put this back to the 1.2  
branch because there are definitely apps that would benefit from it.




On Oct 23, 2007, at 10:19 AM, Gleb Natapov wrote:


So this problem goes WAY back..

The problem here is that the PML marks MPI completion just prior to  
calling
btl_send and then returns to the user. This wouldn't be a problem  
if the BTL
then did something, but in the case of OpenIB this fragment may not  
actually

be on the wire (the joys of user level flow control).

One solution that we proposed was to allow btl_send to return either
OMPI_SUCCESS or OMPI_NOT_ON_WIRE. OMPI_NOT_ON_WIRE would allow the  
PML to
not mark MPI completion of the fragment and then MPI_WAITALL and  
others will

do there job properly.
I even implemented this once, but there is a problem. Currently we  
mark
request as completed on MPI level and then do btl_send(). Whenever  
IB completion

will happen the request will be marked as complete on PML level and
freed. The fix requires to change the order like this: Call  
btl_send(),
check return value from BTL and mark request complete as necessary.  
The
problem is that because we allow BTL to call opal_progress()  
internally the
request may be already completed on MPI and MPL levels and freed  
before return from

the call to btl_send().

I did a code review to see how hard it will be to get rid of recursion
in Open MPI and I think this is doable. We have to disallow calling
progress() (or other functions that may call progress() internally)  
from

BTL and from ULP callbacks that are called by BTL. There is no much
places that break this law. The main offenders are calls to
FREE_LIST_WAIT(), but those never actually call progress if they can
grow without limit and this is the most common use of FREE_LIST_WAIT()
so they may be safely changed to FREE_LIST_GET(). After we will solve
recursion problem the fix to the problem will be a couple of lines of
code.



- Galen



On 10/11/07 11:26 AM, "Gleb Natapov"  wrote:


On Fri, Oct 05, 2007 at 09:43:44AM +0200, Jeff Squyres wrote:

David --

Gleb and I just actively re-looked at this problem yesterday; we
think it's related to https://svn.open-mpi.org/trac/ompi/ticket/
1015.  We previously thought this ticket was a different problem,  
but

our analysis yesterday shows that it could be a real problem in the
openib BTL or ob1 PML (kinda think it's the openib btl because it
doesn't seem to happen on other networks, but who knows...).

Gleb is investigating.
Here is the result of the investigation. The problem is different  
than

#1015 ticket. What we have here is one rank calls isend() of a small
message and wait_all() in a loop and another one calls irecv(). The
problem is that isend() usually doesn't call opal_progress()  
anywhere
and wait_all() doesn't call progress if all 

Re: [OMPI devel] Multiworld MCA parameter values broken

2007-11-07 Thread Ralph H Castain
Sorry for delay - wasn't ignoring the issue.

There are several fixes to this problem - ranging in order from least to
most work:

1. just alias "ssh" to be "ssh -Y" and run without setting the mca param. It
won't affect anything on the backend because the daemon/procs don't use ssh.

2. include "pls_rsh_agent" in the array of mca params not to be passed to
the orted in orte/mca/pls/base/pls_base_general_support_fns.c, the
orte_pls_base_orted_append_basic_args function. This would fix the specific
problem cited here, but I admit that listing every such param by name would
get tedious.

3. we could easily detect that a "problem" character was in the mca param
value when we add it to the orted's argv, and then put "" around it. The
problem, however, is that the mca param parser on the far end doesn't remove
those "" from the resulting string. At least, I spent over a day fighting
with a problem only to discover that was happening. Could be an error in the
way I was doing things, or could be a real characteristic of the parser.
Anyway, we would have to ensure that the parser removes any surrounding ""
before passing along the param value or this won't work.

Ralph



On 11/5/07 12:10 PM, "Tim Prins"  wrote:

> Hi,
> 
> Commit 16364 broke things when using multiword mca param values. For
> instance:
> 
> mpirun --debug-daemons -mca orte_debug 1 -mca pls rsh -mca pls_rsh_agent
> "ssh -Y" xterm
> 
> Will crash and burn, because the value "ssh -Y" is being stored into the
> argv orted_cmd_line in orterun.c:1506. This is then added to the launch
> command for the orted:
> 
> /usr/bin/ssh -Y odin004  PATH=/san/homedirs/tprins/usr/rsl/bin:$PATH ;
> export PATH ; 
> LD_LIBRARY_PATH=/san/homedirs/tprins/usr/rsl/lib:$LD_LIBRARY_PATH ;
> export LD_LIBRARY_PATH ; /san/homedirs/tprins/usr/rsl/bin/orted --debug
> --debug-daemons --name 0.1 --num_procs 2 --vpid_start 0 --nodename
> odin004 --universe tpr...@odin.cs.indiana.edu:default-universe-27872
> --nsreplica 
> "0.0;tcp://129.79.240.100:40907;tcp6://2001:18e8:2:240:2e0:81ff:fe2d:21a0:4090
> 8" 
> --gprreplica 
> "0.0;tcp://129.79.240.100:40907;tcp6://2001:18e8:2:240:2e0:81ff:fe2d:21a0:4090
> 8" 
> -mca orte_debug 1 -mca pls_rsh_agent ssh -Y -mca
> mca_base_param_file_path
> /u/tprins/usr/rsl/share/openmpi/amca-param-sets:/san/homedirs/tprins/rsl/examp
> les 
> -mca mca_base_param_file_path_force /san/homedirs/tprins/rsl/examples
> 
> Notice that in this command we now have "-mca pls_rsh_agent ssh -Y". So
> the quotes have been lost, as we die a horrible death.
> 
> So we need to add the quotes back in somehow, or pass these options
> differently. I'm not sure what the best way to fix this.
> 
> Thanks,
> 
> Tim