date:20071107

Re: [OMPI devel] collective problems

2007-11-07 Thread Richard Graham




On 11/8/07 12:25 AM, "Patrick Geoffray"  wrote:

> Richard Graham wrote:
>> The real problem, as you and others have pointed out is the lack of
>> predictable time slices for the progress engine to do its work, when relying
>> on the ULP to make calls into the library...
> 
> The real, real problem is that the BTL should handle progression at
> their level, specially when the buffering is due to BTL-level flow
> control. When I write something into a socket, TCP will take care of
> sending it eventually, for example.

Agreed - but if it relies on the ULP to get into the progress engine, you
still have the problem of a lack of predictable time slices.

> 
> Rich, your clock is one hour late (we change to standard time a couple
> of days ago...)

Thanks.  This is an Entourage problem I have not yet managed to figure out
how to fix ...

Rich

> 
> Patrick
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] collective problems

2007-11-07 Thread Shipman, Galen M.

The lengths we go to avoid progress :-)




On 11/7/07 10:19 PM, "Richard Graham"  wrote:

> The real problem, as you and others have pointed out is the lack of
> predictable time slices for the progress engine to do its work, when relying
> on the ULP to make calls into the library...
> 
> Rich
> 
> 
> On 11/8/07 12:07 AM, "Brian Barrett"  wrote:
> 
>> As it stands today, the problem is that we can inject things into the
>> BTL successfully that are not injected into the NIC (due to software
>> flow control).  Once a message is injected into the BTL, the PML marks
>> completion on the MPI request.  If it was a blocking send that got
>> marked as complete, but the message isn't injected into the NIC/NIC
>> library, and the user doesn't re-enter the MPI library for a
>> considerable amount of time, then we have a problem.
>> 
>> Personally, I'd rather just not mark MPI completion until a local
>> completion callback from the BTL.  But others don't like that idea, so
>> we came up with a way for back pressure from the BTL to say "it's not
>> on the wire yet".  This is more complicated than just not marking MPI
>> completion early, but why would we do something that helps real apps
>> at the expense of benchmarks?  That would just be silly!
>> 
>> Brian
>> 
>> On Nov 7, 2007, at 7:56 PM, Richard Graham wrote:
>> 
>>> Does this mean that we don¹t have a queue to store btl level
>>> descriptors that
>>>  are only partially complete ?  Do we do an all or nothing with
>>> respect to btl
>>>  level requests at this stage ?
>>> 
>>> Seems to me like we want to mark things complete at the MPI level
>>> ASAP, and
>>>  that this proposal is not to do that  is this correct ?
>>> 
>>> Rich
>>> 
>>> 
>>> On 11/7/07 11:26 PM, "Jeff Squyres"  wrote:
>>> 
 On Nov 7, 2007, at 9:33 PM, Patrick Geoffray wrote:
 
>> Remember that this is all in the context of Galen's proposal for
>> btl_send() to be able to return NOT_ON_WIRE -- meaning that the
 send
>> was successful, but it has not yet been sent (e.g., openib BTL
>> buffered it because it ran out of credits).
> 
> Sorry if I miss something obvious, but why does the PML has to be
> aware
> of the flow control situation of the BTL ? If the BTL cannot send
> something right away for any reason, it should be the
 responsibility
> of
> the BTL to buffer it and to progress on it later.
 
 
 That's currently the way it is.  But the BTL currently only has the
 option to say two things:
 
 1. "ok, done!" -- then the PML will think that the request is
 complete
 2. "doh -- error!" -- then the PML thinks that Something Bad
 Happened(tm)
 
 What we really need is for the BTL to have a third option:
 
 3. "not done yet!"
 
 So that the PML knows that the request is not yet done, but will
 allow
 other things to progress while we're waiting for it to complete.
 Without this, the openib BTL currently replies "ok, done!", even when
 it has only buffered a message (rather than actually sending it out).
 This optimization works great (yeah, I know...) except for apps that
 don't dip into the MPI library frequently.  :-\
 
 --
 Jeff Squyres
 Cisco Systems
 
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
 
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] collective problems

2007-11-07 Thread Patrick Geoffray


Richard Graham wrote:

The real problem, as you and others have pointed out is the lack of
predictable time slices for the progress engine to do its work, when relying
on the ULP to make calls into the library...


The real, real problem is that the BTL should handle progression at 
their level, specially when the buffering is due to BTL-level flow 
control. When I write something into a socket, TCP will take care of 
sending it eventually, for example.


Rich, your clock is one hour late (we change to standard time a couple 
of days ago...)


Patrick

Re: [OMPI devel] collective problems

2007-11-07 Thread Richard Graham

The real problem, as you and others have pointed out is the lack of
predictable time slices for the progress engine to do its work, when relying
on the ULP to make calls into the library...

Rich


On 11/8/07 12:07 AM, "Brian Barrett"  wrote:

> As it stands today, the problem is that we can inject things into the
> BTL successfully that are not injected into the NIC (due to software
> flow control).  Once a message is injected into the BTL, the PML marks
> completion on the MPI request.  If it was a blocking send that got
> marked as complete, but the message isn't injected into the NIC/NIC
> library, and the user doesn't re-enter the MPI library for a
> considerable amount of time, then we have a problem.
> 
> Personally, I'd rather just not mark MPI completion until a local
> completion callback from the BTL.  But others don't like that idea, so
> we came up with a way for back pressure from the BTL to say "it's not
> on the wire yet".  This is more complicated than just not marking MPI
> completion early, but why would we do something that helps real apps
> at the expense of benchmarks?  That would just be silly!
> 
> Brian
> 
> On Nov 7, 2007, at 7:56 PM, Richard Graham wrote:
> 
>> Does this mean that we don¹t have a queue to store btl level
>> descriptors that
>>  are only partially complete ?  Do we do an all or nothing with
>> respect to btl
>>  level requests at this stage ?
>> 
>> Seems to me like we want to mark things complete at the MPI level
>> ASAP, and
>>  that this proposal is not to do that  is this correct ?
>> 
>> Rich
>> 
>> 
>> On 11/7/07 11:26 PM, "Jeff Squyres"  wrote:
>> 
>>> On Nov 7, 2007, at 9:33 PM, Patrick Geoffray wrote:
>>> 
> Remember that this is all in the context of Galen's proposal for
> btl_send() to be able to return NOT_ON_WIRE -- meaning that the
>>> send
> was successful, but it has not yet been sent (e.g., openib BTL
> buffered it because it ran out of credits).
 
 Sorry if I miss something obvious, but why does the PML has to be
 aware
 of the flow control situation of the BTL ? If the BTL cannot send
 something right away for any reason, it should be the
>>> responsibility
 of
 the BTL to buffer it and to progress on it later.
>>> 
>>> 
>>> That's currently the way it is.  But the BTL currently only has the
>>> option to say two things:
>>> 
>>> 1. "ok, done!" -- then the PML will think that the request is
>>> complete
>>> 2. "doh -- error!" -- then the PML thinks that Something Bad
>>> Happened(tm)
>>> 
>>> What we really need is for the BTL to have a third option:
>>> 
>>> 3. "not done yet!"
>>> 
>>> So that the PML knows that the request is not yet done, but will
>>> allow
>>> other things to progress while we're waiting for it to complete.
>>> Without this, the openib BTL currently replies "ok, done!", even when
>>> it has only buffered a message (rather than actually sending it out).
>>> This optimization works great (yeah, I know...) except for apps that
>>> don't dip into the MPI library frequently.  :-\
>>> 
>>> --
>>> Jeff Squyres
>>> Cisco Systems
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] collective problems

2007-11-07 Thread Brian Barrett

As it stands today, the problem is that we can inject things into the  
BTL successfully that are not injected into the NIC (due to software  
flow control).  Once a message is injected into the BTL, the PML marks  
completion on the MPI request.  If it was a blocking send that got  
marked as complete, but the message isn't injected into the NIC/NIC  
library, and the user doesn't re-enter the MPI library for a  
considerable amount of time, then we have a problem.

Personally, I'd rather just not mark MPI completion until a local  
completion callback from the BTL.  But others don't like that idea, so  
we came up with a way for back pressure from the BTL to say "it's not  
on the wire yet".  This is more complicated than just not marking MPI  
completion early, but why would we do something that helps real apps  
at the expense of benchmarks?  That would just be silly!

Brian

On Nov 7, 2007, at 7:56 PM, Richard Graham wrote:

Does this mean that we don’t have a queue to store btl level  
descriptors that
 are only partially complete ?  Do we do an all or nothing with  
respect to btl

 level requests at this stage ?

Seems to me like we want to mark things complete at the MPI level  
ASAP, and

 that this proposal is not to do that – is this correct ?

Rich

On 11/7/07 11:26 PM, "Jeff Squyres"  wrote:

On Nov 7, 2007, at 9:33 PM, Patrick Geoffray wrote:

>> Remember that this is all in the context of Galen's proposal for
>> btl_send() to be able to return NOT_ON_WIRE -- meaning that the  
send

>> was successful, but it has not yet been sent (e.g., openib BTL
>> buffered it because it ran out of credits).
>
> Sorry if I miss something obvious, but why does the PML has to be
> aware
> of the flow control situation of the BTL ? If the BTL cannot send
> something right away for any reason, it should be the  
responsibility

> of
> the BTL to buffer it and to progress on it later.

That's currently the way it is.  But the BTL currently only has the
option to say two things:

1. "ok, done!" -- then the PML will think that the request is  
complete

2. "doh -- error!" -- then the PML thinks that Something Bad
Happened(tm)

What we really need is for the BTL to have a third option:

3. "not done yet!"

So that the PML knows that the request is not yet done, but will  
allow

other things to progress while we're waiting for it to complete.
Without this, the openib BTL currently replies "ok, done!", even when
it has only buffered a message (rather than actually sending it out).
This optimization works great (yeah, I know...) except for apps that
don't dip into the MPI library frequently.  :-\

--
Jeff Squyres
Cisco Systems

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] collective problems

2007-11-07 Thread Richard Graham

Does this mean that we don¹t have a queue to store btl level descriptors
that
 are only partially complete ?  Do we do an all or nothing with respect to
btl
 level requests at this stage ?

Seems to me like we want to mark things complete at the MPI level ASAP, and
 that this proposal is not to do that  is this correct ?

Rich


On 11/7/07 11:26 PM, "Jeff Squyres"  wrote:

> On Nov 7, 2007, at 9:33 PM, Patrick Geoffray wrote:
> 
>>> >> Remember that this is all in the context of Galen's proposal for
>>> >> btl_send() to be able to return NOT_ON_WIRE -- meaning that the send
>>> >> was successful, but it has not yet been sent (e.g., openib BTL
>>> >> buffered it because it ran out of credits).
>> >
>> > Sorry if I miss something obvious, but why does the PML has to be
>> > aware
>> > of the flow control situation of the BTL ? If the BTL cannot send
>> > something right away for any reason, it should be the responsibility
>> > of
>> > the BTL to buffer it and to progress on it later.
> 
> 
> That's currently the way it is.  But the BTL currently only has the
> option to say two things:
> 
> 1. "ok, done!" -- then the PML will think that the request is complete
> 2. "doh -- error!" -- then the PML thinks that Something Bad
> Happened(tm)
> 
> What we really need is for the BTL to have a third option:
> 
> 3. "not done yet!"
> 
> So that the PML knows that the request is not yet done, but will allow
> other things to progress while we're waiting for it to complete.
> Without this, the openib BTL currently replies "ok, done!", even when
> it has only buffered a message (rather than actually sending it out).
> This optimization works great (yeah, I know...) except for apps that
> don't dip into the MPI library frequently.  :-\
> 
> --
> Jeff Squyres
> Cisco Systems
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

Re: [OMPI devel] collective problems

2007-11-07 Thread Jeff Squyres


On Nov 7, 2007, at 9:33 PM, Patrick Geoffray wrote:


Remember that this is all in the context of Galen's proposal for
btl_send() to be able to return NOT_ON_WIRE -- meaning that the send
was successful, but it has not yet been sent (e.g., openib BTL
buffered it because it ran out of credits).


Sorry if I miss something obvious, but why does the PML has to be  
aware

of the flow control situation of the BTL ? If the BTL cannot send
something right away for any reason, it should be the responsibility  
of

the BTL to buffer it and to progress on it later.



That's currently the way it is.  But the BTL currently only has the  
option to say two things:


1. "ok, done!" -- then the PML will think that the request is complete
2. "doh -- error!" -- then the PML thinks that Something Bad  
Happened(tm)


What we really need is for the BTL to have a third option:

3. "not done yet!"

So that the PML knows that the request is not yet done, but will allow  
other things to progress while we're waiting for it to complete.   
Without this, the openib BTL currently replies "ok, done!", even when  
it has only buffered a message (rather than actually sending it out).   
This optimization works great (yeah, I know...) except for apps that  
don't dip into the MPI library frequently.  :-\


--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] collective problems

2007-11-07 Thread Patrick Geoffray


Jeff Squyres wrote:

This is not a problem in the current code base.

Remember that this is all in the context of Galen's proposal for  
btl_send() to be able to return NOT_ON_WIRE -- meaning that the send  
was successful, but it has not yet been sent (e.g., openib BTL  
buffered it because it ran out of credits).


Sorry if I miss something obvious, but why does the PML has to be aware 
of the flow control situation of the BTL ? If the BTL cannot send 
something right away for any reason, it should be the responsibility of 
the BTL to buffer it and to progress on it later.


Patrick

Re: [OMPI devel] Multiworld MCA parameter values broken

2007-11-07 Thread Ralph Castain

What changed is that we never passed mca params to the orted before - they
always went to the app, but it's the orted that has the issue. There is a
bug ticket thread on this subject - I forget the number immediately.

Basically, the problem was that we cannot generally pass the local
environment to the orteds when we launch them. However, people needed
various mca params to get to the orteds to control their behavior. The only
way to resolve that problem was to pass the params via the command line,
which is what was done.

Except for a very few cases, all of our mca params are single values that do
not include spaces, so this is not a problem that is causing widespread
issues. As I said, I already had to deal with one special case that didn't
involve spaces, but did have special characters that required quoting, which
identified the larger problem of dealing with quoted strings.

I have no objection to a more general fix. Like I said in my note, though,
the general fix will take a larger effort. If someone is willing to do so,
that is fine with me - I was only offering solutions that would fill the
interim time as I haven't heard anyone step up to say they would fix it
anytime soon.

Please feel free to jump in and volunteer! ;-) I'm willing to put the quotes
around things if you will fix the mca cmd line parser to cleanly remove them
on the other end.

Ralph

On 11/7/07 5:50 PM, "Tim Prins"  wrote:

> I'm curious what changed to make this a problem. How were we passing mca param
> from the base to the app before, and why did it change?
> 
> I think that options 1 & 2 below are no good, since we, in general, allow
> string mca params to have spaces (as far as I understand it). So a more
> general approach is needed.
> 
> Tim
> 
> On Wednesday 07 November 2007 10:40:45 am Ralph H Castain wrote:
>> Sorry for delay - wasn't ignoring the issue.
>> 
>> There are several fixes to this problem - ranging in order from least to
>> most work:
>> 
>> 1. just alias "ssh" to be "ssh -Y" and run without setting the mca param.
>> It won't affect anything on the backend because the daemon/procs don't use
>> ssh.
>> 
>> 2. include "pls_rsh_agent" in the array of mca params not to be passed to
>> the orted in orte/mca/pls/base/pls_base_general_support_fns.c, the
>> orte_pls_base_orted_append_basic_args function. This would fix the specific
>> problem cited here, but I admit that listing every such param by name would
>> get tedious.
>> 
>> 3. we could easily detect that a "problem" character was in the mca param
>> value when we add it to the orted's argv, and then put "" around it. The
>> problem, however, is that the mca param parser on the far end doesn't
>> remove those "" from the resulting string. At least, I spent over a day
>> fighting with a problem only to discover that was happening. Could be an
>> error in the way I was doing things, or could be a real characteristic of
>> the parser. Anyway, we would have to ensure that the parser removes any
>> surrounding "" before passing along the param value or this won't work.
>> 
>> Ralph
>> 
>> On 11/5/07 12:10 PM, "Tim Prins"  wrote:
>>> Hi,
>>> 
>>> Commit 16364 broke things when using multiword mca param values. For
>>> instance:
>>> 
>>> mpirun --debug-daemons -mca orte_debug 1 -mca pls rsh -mca pls_rsh_agent
>>> "ssh -Y" xterm
>>> 
>>> Will crash and burn, because the value "ssh -Y" is being stored into the
>>> argv orted_cmd_line in orterun.c:1506. This is then added to the launch
>>> command for the orted:
>>> 
>>> /usr/bin/ssh -Y odin004  PATH=/san/homedirs/tprins/usr/rsl/bin:$PATH ;
>>> export PATH ;
>>> LD_LIBRARY_PATH=/san/homedirs/tprins/usr/rsl/lib:$LD_LIBRARY_PATH ;
>>> export LD_LIBRARY_PATH ; /san/homedirs/tprins/usr/rsl/bin/orted --debug
>>> --debug-daemons --name 0.1 --num_procs 2 --vpid_start 0 --nodename
>>> odin004 --universe tpr...@odin.cs.indiana.edu:default-universe-27872
>>> --nsreplica
>>> "0.0;tcp://129.79.240.100:40907;tcp6://2001:18e8:2:240:2e0:81ff:fe2d:21a0
>>> :4090 8"
>>> --gprreplica
>>> "0.0;tcp://129.79.240.100:40907;tcp6://2001:18e8:2:240:2e0:81ff:fe2d:21a0
>>> :4090 8"
>>> -mca orte_debug 1 -mca pls_rsh_agent ssh -Y -mca
>>> mca_base_param_file_path
>>> /u/tprins/usr/rsl/share/openmpi/amca-param-sets:/san/homedirs/tprins/rsl/
>>> examp les
>>> -mca mca_base_param_file_path_force /san/homedirs/tprins/rsl/examples
>>> 
>>> Notice that in this command we now have "-mca pls_rsh_agent ssh -Y". So
>>> the quotes have been lost, as we die a horrible death.
>>> 
>>> So we need to add the quotes back in somehow, or pass these options
>>> differently. I'm not sure what the best way to fix this.
>>> 
>>> Thanks,
>>> 
>>> Tim
> 
>

Re: [OMPI devel] Multiworld MCA parameter values broken

2007-11-07 Thread Tim Prins

I'm curious what changed to make this a problem. How were we passing mca param 
from the base to the app before, and why did it change?

I think that options 1 & 2 below are no good, since we, in general, allow 
string mca params to have spaces (as far as I understand it). So a more 
general approach is needed. 

Tim

On Wednesday 07 November 2007 10:40:45 am Ralph H Castain wrote:
> Sorry for delay - wasn't ignoring the issue.
>
> There are several fixes to this problem - ranging in order from least to
> most work:
>
> 1. just alias "ssh" to be "ssh -Y" and run without setting the mca param.
> It won't affect anything on the backend because the daemon/procs don't use
> ssh.
>
> 2. include "pls_rsh_agent" in the array of mca params not to be passed to
> the orted in orte/mca/pls/base/pls_base_general_support_fns.c, the
> orte_pls_base_orted_append_basic_args function. This would fix the specific
> problem cited here, but I admit that listing every such param by name would
> get tedious.
>
> 3. we could easily detect that a "problem" character was in the mca param
> value when we add it to the orted's argv, and then put "" around it. The
> problem, however, is that the mca param parser on the far end doesn't
> remove those "" from the resulting string. At least, I spent over a day
> fighting with a problem only to discover that was happening. Could be an
> error in the way I was doing things, or could be a real characteristic of
> the parser. Anyway, we would have to ensure that the parser removes any
> surrounding "" before passing along the param value or this won't work.
>
> Ralph
>
> On 11/5/07 12:10 PM, "Tim Prins"  wrote:
> > Hi,
> >
> > Commit 16364 broke things when using multiword mca param values. For
> > instance:
> >
> > mpirun --debug-daemons -mca orte_debug 1 -mca pls rsh -mca pls_rsh_agent
> > "ssh -Y" xterm
> >
> > Will crash and burn, because the value "ssh -Y" is being stored into the
> > argv orted_cmd_line in orterun.c:1506. This is then added to the launch
> > command for the orted:
> >
> > /usr/bin/ssh -Y odin004  PATH=/san/homedirs/tprins/usr/rsl/bin:$PATH ;
> > export PATH ;
> > LD_LIBRARY_PATH=/san/homedirs/tprins/usr/rsl/lib:$LD_LIBRARY_PATH ;
> > export LD_LIBRARY_PATH ; /san/homedirs/tprins/usr/rsl/bin/orted --debug
> > --debug-daemons --name 0.1 --num_procs 2 --vpid_start 0 --nodename
> > odin004 --universe tpr...@odin.cs.indiana.edu:default-universe-27872
> > --nsreplica
> > "0.0;tcp://129.79.240.100:40907;tcp6://2001:18e8:2:240:2e0:81ff:fe2d:21a0
> >:4090 8"
> > --gprreplica
> > "0.0;tcp://129.79.240.100:40907;tcp6://2001:18e8:2:240:2e0:81ff:fe2d:21a0
> >:4090 8"
> > -mca orte_debug 1 -mca pls_rsh_agent ssh -Y -mca
> > mca_base_param_file_path
> > /u/tprins/usr/rsl/share/openmpi/amca-param-sets:/san/homedirs/tprins/rsl/
> >examp les
> > -mca mca_base_param_file_path_force /san/homedirs/tprins/rsl/examples
> >
> > Notice that in this command we now have "-mca pls_rsh_agent ssh -Y". So
> > the quotes have been lost, as we die a horrible death.
> >
> > So we need to add the quotes back in somehow, or pass these options
> > differently. I'm not sure what the best way to fix this.
> >
> > Thanks,
> >
> > Tim

Re: [OMPI devel] accessors to context id and message id's

2007-11-07 Thread George Bosilca



On Nov 6, 2007, at 8:38 AM, Terry Dontje wrote:


George Bosilca wrote:

If I understand correctly your question, then we don't need any
extension. Each request has a unique ID (from PERUSE perspective).
However, if I remember well this is only half implemented in our
PERUSE layer (i.e. it works only for expected requests).

Looking at the peruse macros it looks to be that the unique ID is the
base_req address which I imagine rarely matches between processes.


That's a completely different topic. If what you need is a unique ID  
for each request between processes, in other words, a unique ID for  
each message, then here is the way to go. Use the same information as  
the MPI matching logic, i.e. (comm_id, remote, tag) to create an  
identifier for each message. It will not be unique as multiple  
messages can generate the same ID, but you can generate a unique ID  
per messages with easy tricks.


The PERUSE standard requires that the ID is unique for each process,  
and for the lifetime of the request. It does not require that the ID  
be unique across processes. And this is why we're using the base_req  
as an ID.


  george.




This should be quite easy to fix, if someone invest few hours into  
it.


For the context id, a user can always use the c2f function to get the
fortran ID (which for Open MPI is the communicator ID).


Cool, I didn't realize that.

thanks,

--td

 Thanks,
   george.

On Nov 5, 2007, at 8:01 AM, Terry Dontje wrote:

Currently in order to do message tracing one either has to rely on  
some
error prone postprocessing of data or replicating some MPI  
internals up

in the PMPI layer.  It would help Sun's tools group (and I believe U
Dresden also) if Open MPI would create a couple APIs that exoposed  
the

following:

1. PML Message ids used for a request
2. Context id for a specific communicator

I could see a couple ways of providing this information.  Either by
extending the PERUSE probes or creating actual functions that one  
would
pass in a request handle or communicator handle to get the  
appropriate

data back.

This is just a thought right now which why this email is not in an  
RFC
format.  I wanted to get a feel from the community as to the  
interest in
such APIs and if anyone may have specific issues with us providing  
such
interfaces.  If the responses seems positive I will follow this  
message

up with an RFC.

thanks,

--td
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




smime.p7s
Description: S/MIME cryptographic signature

Re: [OMPI devel] collective problems

2007-11-07 Thread Jeff Squyres


This is not a problem in the current code base.

Remember that this is all in the context of Galen's proposal for  
btl_send() to be able to return NOT_ON_WIRE -- meaning that the send  
was successful, but it has not yet been sent (e.g., openib BTL  
buffered it because it ran out of credits).


Read these two messages again to get the context:

http://www.open-mpi.org/community/lists/devel/2007/10/2486.php
http://www.open-mpi.org/community/lists/devel/2007/10/2487.php

Gleb describes the recursive problem (paired with the concept of  
NOT_ON_WIRE) nicely in his post.


Make sense?



On Nov 7, 2007, at 1:16 PM, George Bosilca wrote:



On Nov 7, 2007, at 12:51 PM, Jeff Squyres wrote:


The same callback is called in both cases. In the case that you
described, the callback is called just a little bit deeper into the
recursion, when in the "normal case" it will get called from the
first level of the recursion. Or maybe I miss something here ...


Right -- it's not the callback that is the problem.  It's when the
recursion is unwound and further up the stack you now have a stale
request.


That's exactly the point that I fail to see. If the request is freed  
in the PML callback, then it should get release in both cases, and  
therefore lead to problems all the time. Which, obviously, is not  
true when we do not have this deep recursion thing going on.


Moreover, he request management is based on the reference count. The  
PML level have one ref count and the MPI level have another one. In  
fact, we cannot release a request until we explicitly call  
ompi_request_free on it. The place where this call happens is  
different between the blocking and non blocking calls. In the non  
blocking case the ompi_request_free get called from the *_test  
(*_wait) functions while in the blocking case it get called directly  
from the MPI_Send function.


Let me summarize: a request cannot reach a stale state without a  
call to ompi_request_free. This function is never called directly  
from the PML level. Therefore, the recursion depth should not have  
any impact on the state of the request !


Is there a simple test case I can run in order to trigger this  
strange behavior ?


 Thanks,
   george.







george.


This is *only* a problem for requests that are involved from the
current top-level MPI call.  Request from prior calls to MPI
functions
(e.g., a request from a prior call to MPI_ISEND) are ok because a)
we've already done the Right Things to ensure the safety of that
request, and b) that request is not on the recursive stack anywhere
to
become stale as the recursion unwinds.

Right?

If so, Galen proposes the following:

1. in conjunction with the NOT_ON_WIRE proposal...

2. make a new PML request flag DONT_FREE_ME (or some better
name :-) ).

3. blocking MPI_SEND/MPI_RECV calls will set this flag (or, more
specifically, the top of the PML calls for blocking send/receive)
right when the request is allocated (i.e., before calling
btl_send()).

4. when the PML is called for completion on this request, it will  
do
all the stuff that it needs to effect completion -- but then it  
will

see the DONT_FREE_ME flag and not actually free the request.
Obviously, if DONT_FREE_ME is *not* set, then the PML does what it
does today: it frees the request.

5. the top-level PML call will eventually complete:
5a. For blocking PML calls (e.g., corresponding to MPI_SEND and
MPI_RECV), the request can be unconditionally freed.
5b. For non-blocking PML calls (e.g., corresponding to MPI_ISEND),
only free the request if it was completed.

Note that with this scheme, it becomes irrelevant as to whether the
PML completion call is invoked on the first descent into the BTL or
recursively via opal_progress.

How does that sound?

If that all works, it might be beneficial to put this back to the  
1.2
branch because there are definitely apps that would benefit from  
it.




On Oct 23, 2007, at 10:19 AM, Gleb Natapov wrote:


So this problem goes WAY back..

The problem here is that the PML marks MPI completion just  
prior to

calling
btl_send and then returns to the user. This wouldn't be a problem
if the BTL
then did something, but in the case of OpenIB this fragment may  
not

actually
be on the wire (the joys of user level flow control).

One solution that we proposed was to allow btl_send to return
either
OMPI_SUCCESS or OMPI_NOT_ON_WIRE. OMPI_NOT_ON_WIRE would allow  
the

PML to
not mark MPI completion of the fragment and then MPI_WAITALL and
others will
do there job properly.

I even implemented this once, but there is a problem. Currently we
mark
request as completed on MPI level and then do btl_send(). Whenever
IB completion
will happen the request will be marked as complete on PML level  
and

freed. The fix requires to change the order like this: Call
btl_send(),
check return value from BTL and mark request complete as  
necessary.

The
problem is that because we allow BTL to call opal_progress()
internally t

Re: [OMPI devel] collective problems

2007-11-07 Thread George Bosilca



On Nov 7, 2007, at 12:51 PM, Jeff Squyres wrote:


The same callback is called in both cases. In the case that you
described, the callback is called just a little bit deeper into the
recursion, when in the "normal case" it will get called from the
first level of the recursion. Or maybe I miss something here ...


Right -- it's not the callback that is the problem.  It's when the
recursion is unwound and further up the stack you now have a stale
request.


That's exactly the point that I fail to see. If the request is freed  
in the PML callback, then it should get release in both cases, and  
therefore lead to problems all the time. Which, obviously, is not true  
when we do not have this deep recursion thing going on.


Moreover, he request management is based on the reference count. The  
PML level have one ref count and the MPI level have another one. In  
fact, we cannot release a request until we explicitly call  
ompi_request_free on it. The place where this call happens is  
different between the blocking and non blocking calls. In the non  
blocking case the ompi_request_free get called from the *_test  
(*_wait) functions while in the blocking case it get called directly  
from the MPI_Send function.


Let me summarize: a request cannot reach a stale state without a call  
to ompi_request_free. This function is never called directly from the  
PML level. Therefore, the recursion depth should not have any impact  
on the state of the request !


Is there a simple test case I can run in order to trigger this strange  
behavior ?


  Thanks,
george.







george.


This is *only* a problem for requests that are involved from the
current top-level MPI call.  Request from prior calls to MPI
functions
(e.g., a request from a prior call to MPI_ISEND) are ok because a)
we've already done the Right Things to ensure the safety of that
request, and b) that request is not on the recursive stack anywhere
to
become stale as the recursion unwinds.

Right?

If so, Galen proposes the following:

1. in conjunction with the NOT_ON_WIRE proposal...

2. make a new PML request flag DONT_FREE_ME (or some better
name :-) ).

3. blocking MPI_SEND/MPI_RECV calls will set this flag (or, more
specifically, the top of the PML calls for blocking send/receive)
right when the request is allocated (i.e., before calling
btl_send()).

4. when the PML is called for completion on this request, it will do
all the stuff that it needs to effect completion -- but then it will
see the DONT_FREE_ME flag and not actually free the request.
Obviously, if DONT_FREE_ME is *not* set, then the PML does what it
does today: it frees the request.

5. the top-level PML call will eventually complete:
5a. For blocking PML calls (e.g., corresponding to MPI_SEND and
MPI_RECV), the request can be unconditionally freed.
5b. For non-blocking PML calls (e.g., corresponding to MPI_ISEND),
only free the request if it was completed.

Note that with this scheme, it becomes irrelevant as to whether the
PML completion call is invoked on the first descent into the BTL or
recursively via opal_progress.

How does that sound?

If that all works, it might be beneficial to put this back to the  
1.2

branch because there are definitely apps that would benefit from it.



On Oct 23, 2007, at 10:19 AM, Gleb Natapov wrote:


So this problem goes WAY back..

The problem here is that the PML marks MPI completion just prior  
to

calling
btl_send and then returns to the user. This wouldn't be a problem
if the BTL
then did something, but in the case of OpenIB this fragment may  
not

actually
be on the wire (the joys of user level flow control).

One solution that we proposed was to allow btl_send to return
either
OMPI_SUCCESS or OMPI_NOT_ON_WIRE. OMPI_NOT_ON_WIRE would allow the
PML to
not mark MPI completion of the fragment and then MPI_WAITALL and
others will
do there job properly.

I even implemented this once, but there is a problem. Currently we
mark
request as completed on MPI level and then do btl_send(). Whenever
IB completion
will happen the request will be marked as complete on PML level and
freed. The fix requires to change the order like this: Call
btl_send(),
check return value from BTL and mark request complete as necessary.
The
problem is that because we allow BTL to call opal_progress()
internally the
request may be already completed on MPI and MPL levels and freed
before return from
the call to btl_send().

I did a code review to see how hard it will be to get rid of
recursion
in Open MPI and I think this is doable. We have to disallow calling
progress() (or other functions that may call progress() internally)
from
BTL and from ULP callbacks that are called by BTL. There is no much
places that break this law. The main offenders are calls to
FREE_LIST_WAIT(), but those never actually call progress if they  
can

grow without limit and this is the most common use of
FREE_LIST_WAIT()
so they may be safely changed to FREE_LIST_GET(). After we will
solve
re

Re: [OMPI devel] collective problems

2007-11-07 Thread Jeff Squyres


On Nov 7, 2007, at 12:29 PM, George Bosilca wrote:


I finally talked with Galen and Don about this issue in depth.  Our
understanding is that the "request may get freed before recursion
unwinds" issue is *only* a problem within the context of a single MPI
call (e.g., MPI_SEND).  Is that right?


I wonder how this happens ?

Specifically, if in an MPI_SEND, the BTL ends up buffering the  
message

and setting early completion, but then recurses into opal_progress()
and ends up sending the message and freeing the request during the
recursion, then when the recursion unwinds, the original caller will
have a stale request.


The same callback is called in both cases. In the case that you  
described, the callback is called just a little bit deeper into the  
recursion, when in the "normal case" it will get called from the  
first level of the recursion. Or maybe I miss something here ...


Right -- it's not the callback that is the problem.  It's when the  
recursion is unwound and further up the stack you now have a stale  
request.




 george.


This is *only* a problem for requests that are involved from the
current top-level MPI call.  Request from prior calls to MPI  
functions

(e.g., a request from a prior call to MPI_ISEND) are ok because a)
we've already done the Right Things to ensure the safety of that
request, and b) that request is not on the recursive stack anywhere  
to

become stale as the recursion unwinds.

Right?

If so, Galen proposes the following:

1. in conjunction with the NOT_ON_WIRE proposal...

2. make a new PML request flag DONT_FREE_ME (or some better  
name :-) ).


3. blocking MPI_SEND/MPI_RECV calls will set this flag (or, more
specifically, the top of the PML calls for blocking send/receive)
right when the request is allocated (i.e., before calling  
btl_send()).


4. when the PML is called for completion on this request, it will do
all the stuff that it needs to effect completion -- but then it will
see the DONT_FREE_ME flag and not actually free the request.
Obviously, if DONT_FREE_ME is *not* set, then the PML does what it
does today: it frees the request.

5. the top-level PML call will eventually complete:
5a. For blocking PML calls (e.g., corresponding to MPI_SEND and
MPI_RECV), the request can be unconditionally freed.
5b. For non-blocking PML calls (e.g., corresponding to MPI_ISEND),
only free the request if it was completed.

Note that with this scheme, it becomes irrelevant as to whether the
PML completion call is invoked on the first descent into the BTL or
recursively via opal_progress.

How does that sound?

If that all works, it might be beneficial to put this back to the 1.2
branch because there are definitely apps that would benefit from it.



On Oct 23, 2007, at 10:19 AM, Gleb Natapov wrote:


So this problem goes WAY back..

The problem here is that the PML marks MPI completion just prior to
calling
btl_send and then returns to the user. This wouldn't be a problem
if the BTL
then did something, but in the case of OpenIB this fragment may not
actually
be on the wire (the joys of user level flow control).

One solution that we proposed was to allow btl_send to return  
either

OMPI_SUCCESS or OMPI_NOT_ON_WIRE. OMPI_NOT_ON_WIRE would allow the
PML to
not mark MPI completion of the fragment and then MPI_WAITALL and
others will
do there job properly.

I even implemented this once, but there is a problem. Currently we
mark
request as completed on MPI level and then do btl_send(). Whenever
IB completion
will happen the request will be marked as complete on PML level and
freed. The fix requires to change the order like this: Call
btl_send(),
check return value from BTL and mark request complete as necessary.
The
problem is that because we allow BTL to call opal_progress()
internally the
request may be already completed on MPI and MPL levels and freed
before return from
the call to btl_send().

I did a code review to see how hard it will be to get rid of  
recursion

in Open MPI and I think this is doable. We have to disallow calling
progress() (or other functions that may call progress() internally)
from
BTL and from ULP callbacks that are called by BTL. There is no much
places that break this law. The main offenders are calls to
FREE_LIST_WAIT(), but those never actually call progress if they can
grow without limit and this is the most common use of  
FREE_LIST_WAIT()
so they may be safely changed to FREE_LIST_GET(). After we will  
solve
recursion problem the fix to the problem will be a couple of lines  
of

code.



- Galen



On 10/11/07 11:26 AM, "Gleb Natapov"  wrote:


On Fri, Oct 05, 2007 at 09:43:44AM +0200, Jeff Squyres wrote:

David --

Gleb and I just actively re-looked at this problem yesterday; we
think it's related to https://svn.open-mpi.org/trac/ompi/ticket/
1015.  We previously thought this ticket was a different problem,
but
our analysis yesterday shows that it could be a real problem in  
the

openib BTL or ob1 PML (kinda think it's the openib

Re: [OMPI devel] collective problems

2007-11-07 Thread George Bosilca



On Nov 7, 2007, at 11:06 AM, Jeff Squyres wrote:


Gleb --

I finally talked with Galen and Don about this issue in depth.  Our
understanding is that the "request may get freed before recursion
unwinds" issue is *only* a problem within the context of a single MPI
call (e.g., MPI_SEND).  Is that right?


I wonder how this happens ?


Specifically, if in an MPI_SEND, the BTL ends up buffering the message
and setting early completion, but then recurses into opal_progress()
and ends up sending the message and freeing the request during the
recursion, then when the recursion unwinds, the original caller will
have a stale request.


The same callback is called in both cases. In the case that you  
described, the callback is called just a little bit deeper into the  
recursion, when in the "normal case" it will get called from the first  
level of the recursion. Or maybe I miss something here ...


  george.


This is *only* a problem for requests that are involved from the
current top-level MPI call.  Request from prior calls to MPI functions
(e.g., a request from a prior call to MPI_ISEND) are ok because a)
we've already done the Right Things to ensure the safety of that
request, and b) that request is not on the recursive stack anywhere to
become stale as the recursion unwinds.

Right?

If so, Galen proposes the following:

1. in conjunction with the NOT_ON_WIRE proposal...

2. make a new PML request flag DONT_FREE_ME (or some better  
name :-) ).


3. blocking MPI_SEND/MPI_RECV calls will set this flag (or, more
specifically, the top of the PML calls for blocking send/receive)
right when the request is allocated (i.e., before calling btl_send()).

4. when the PML is called for completion on this request, it will do
all the stuff that it needs to effect completion -- but then it will
see the DONT_FREE_ME flag and not actually free the request.
Obviously, if DONT_FREE_ME is *not* set, then the PML does what it
does today: it frees the request.

5. the top-level PML call will eventually complete:
5a. For blocking PML calls (e.g., corresponding to MPI_SEND and
MPI_RECV), the request can be unconditionally freed.
5b. For non-blocking PML calls (e.g., corresponding to MPI_ISEND),
only free the request if it was completed.

Note that with this scheme, it becomes irrelevant as to whether the
PML completion call is invoked on the first descent into the BTL or
recursively via opal_progress.

How does that sound?

If that all works, it might be beneficial to put this back to the 1.2
branch because there are definitely apps that would benefit from it.



On Oct 23, 2007, at 10:19 AM, Gleb Natapov wrote:


So this problem goes WAY back..

The problem here is that the PML marks MPI completion just prior to
calling
btl_send and then returns to the user. This wouldn't be a problem
if the BTL
then did something, but in the case of OpenIB this fragment may not
actually
be on the wire (the joys of user level flow control).

One solution that we proposed was to allow btl_send to return either
OMPI_SUCCESS or OMPI_NOT_ON_WIRE. OMPI_NOT_ON_WIRE would allow the
PML to
not mark MPI completion of the fragment and then MPI_WAITALL and
others will
do there job properly.

I even implemented this once, but there is a problem. Currently we
mark
request as completed on MPI level and then do btl_send(). Whenever
IB completion
will happen the request will be marked as complete on PML level and
freed. The fix requires to change the order like this: Call
btl_send(),
check return value from BTL and mark request complete as necessary.
The
problem is that because we allow BTL to call opal_progress()
internally the
request may be already completed on MPI and MPL levels and freed
before return from
the call to btl_send().

I did a code review to see how hard it will be to get rid of  
recursion

in Open MPI and I think this is doable. We have to disallow calling
progress() (or other functions that may call progress() internally)
from
BTL and from ULP callbacks that are called by BTL. There is no much
places that break this law. The main offenders are calls to
FREE_LIST_WAIT(), but those never actually call progress if they can
grow without limit and this is the most common use of  
FREE_LIST_WAIT()

so they may be safely changed to FREE_LIST_GET(). After we will solve
recursion problem the fix to the problem will be a couple of lines of
code.



- Galen



On 10/11/07 11:26 AM, "Gleb Natapov"  wrote:


On Fri, Oct 05, 2007 at 09:43:44AM +0200, Jeff Squyres wrote:

David --

Gleb and I just actively re-looked at this problem yesterday; we
think it's related to https://svn.open-mpi.org/trac/ompi/ticket/
1015.  We previously thought this ticket was a different problem,
but
our analysis yesterday shows that it could be a real problem in  
the

openib BTL or ob1 PML (kinda think it's the openib btl because it
doesn't seem to happen on other networks, but who knows...).

Gleb is investigating.

Here is the result of the investigation. The problem is di

[OMPI devel] Incorrect one-sided test

2007-11-07 Thread Brian W. Barrett


Hi all -

Lisa Glendenning, who's working on a Portals one-sided component, 
discovered that the test onesided/test_start1.c in our repository is 
incorrect.  It assumes that MPI_Win_start is non-blocking, but the 
standard says that "MPI_WIN_START is allowed to block until the 
corresponding MPI_WIN_POST calls are executed".  The pt2pt and rdma 
components did not block, so the test error did not show up with those 
components.


I've fixed the test in r1223, but thought I'd let everyone know I changed 
one of our conformance tests.


Brian

Re: [OMPI devel] v1.2 branch mpi_preconnect_all

2007-11-07 Thread Jeff Squyres

Don, Galen, and I talked about this in depth on the phone today and  
think that it is a symptom of the same issue discussed in this thread:


http://www.open-mpi.org/community/lists/devel/2007/10/2382.php

Note my message in that thread from just a few minutes ago:

http://www.open-mpi.org/community/lists/devel/2007/11/2561.php

We think that the proposed solution to that thread will also fix the  
mpi_preconnect_all issues (i.e., the ping-pong that Don proposes in  
his mail should not be necessary).




On Oct 17, 2007, at 10:54 AM, Don Kerr wrote:


All,

I have noticed an issue in the 1.2 branch when mpi_preconnect_all=1.  
The

one way communication pattern (ranks either send or receive from each
other) may not fully establish connection with peers. Example, if I  
have

a 3 process mpi job and rank 0 does not do any mpi communication after
MPI_Init() the other ranks attempts to connect will not be  
progressed (I

have seen this with tcp and udapl).
The preconnect pattern has changed slightly in the trunk but  
essentially

it is still a one way communication, either send or receive with each
rank. So although the issue I see in the 1.2 branch does not appear in
the trunk I wonder if this will show up again.

An alternative to the preconnect pattern that comes to mind would be  
to
perform a send and receive between all ranks to ensure that  
connections

have been fully established.

Does anyone have thoughts or comments on this, or reasons not to have
all ranks send and receive from all?

-DON
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] collective problems

2007-11-07 Thread Jeff Squyres


Gleb --

I finally talked with Galen and Don about this issue in depth.  Our  
understanding is that the "request may get freed before recursion  
unwinds" issue is *only* a problem within the context of a single MPI  
call (e.g., MPI_SEND).  Is that right?


Specifically, if in an MPI_SEND, the BTL ends up buffering the message  
and setting early completion, but then recurses into opal_progress()  
and ends up sending the message and freeing the request during the  
recursion, then when the recursion unwinds, the original caller will  
have a stale request.


This is *only* a problem for requests that are involved from the  
current top-level MPI call.  Request from prior calls to MPI functions  
(e.g., a request from a prior call to MPI_ISEND) are ok because a)  
we've already done the Right Things to ensure the safety of that  
request, and b) that request is not on the recursive stack anywhere to  
become stale as the recursion unwinds.


Right?

If so, Galen proposes the following:

1. in conjunction with the NOT_ON_WIRE proposal...

2. make a new PML request flag DONT_FREE_ME (or some better name :-) ).

3. blocking MPI_SEND/MPI_RECV calls will set this flag (or, more  
specifically, the top of the PML calls for blocking send/receive)  
right when the request is allocated (i.e., before calling btl_send()).


4. when the PML is called for completion on this request, it will do  
all the stuff that it needs to effect completion -- but then it will  
see the DONT_FREE_ME flag and not actually free the request.   
Obviously, if DONT_FREE_ME is *not* set, then the PML does what it  
does today: it frees the request.


5. the top-level PML call will eventually complete:
5a. For blocking PML calls (e.g., corresponding to MPI_SEND and  
MPI_RECV), the request can be unconditionally freed.
5b. For non-blocking PML calls (e.g., corresponding to MPI_ISEND),  
only free the request if it was completed.


Note that with this scheme, it becomes irrelevant as to whether the  
PML completion call is invoked on the first descent into the BTL or  
recursively via opal_progress.


How does that sound?

If that all works, it might be beneficial to put this back to the 1.2  
branch because there are definitely apps that would benefit from it.




On Oct 23, 2007, at 10:19 AM, Gleb Natapov wrote:


So this problem goes WAY back..

The problem here is that the PML marks MPI completion just prior to  
calling
btl_send and then returns to the user. This wouldn't be a problem  
if the BTL
then did something, but in the case of OpenIB this fragment may not  
actually

be on the wire (the joys of user level flow control).

One solution that we proposed was to allow btl_send to return either
OMPI_SUCCESS or OMPI_NOT_ON_WIRE. OMPI_NOT_ON_WIRE would allow the  
PML to
not mark MPI completion of the fragment and then MPI_WAITALL and  
others will

do there job properly.
I even implemented this once, but there is a problem. Currently we  
mark
request as completed on MPI level and then do btl_send(). Whenever  
IB completion

will happen the request will be marked as complete on PML level and
freed. The fix requires to change the order like this: Call  
btl_send(),
check return value from BTL and mark request complete as necessary.  
The
problem is that because we allow BTL to call opal_progress()  
internally the
request may be already completed on MPI and MPL levels and freed  
before return from

the call to btl_send().

I did a code review to see how hard it will be to get rid of recursion
in Open MPI and I think this is doable. We have to disallow calling
progress() (or other functions that may call progress() internally)  
from

BTL and from ULP callbacks that are called by BTL. There is no much
places that break this law. The main offenders are calls to
FREE_LIST_WAIT(), but those never actually call progress if they can
grow without limit and this is the most common use of FREE_LIST_WAIT()
so they may be safely changed to FREE_LIST_GET(). After we will solve
recursion problem the fix to the problem will be a couple of lines of
code.



- Galen



On 10/11/07 11:26 AM, "Gleb Natapov"  wrote:


On Fri, Oct 05, 2007 at 09:43:44AM +0200, Jeff Squyres wrote:

David --

Gleb and I just actively re-looked at this problem yesterday; we
think it's related to https://svn.open-mpi.org/trac/ompi/ticket/
1015.  We previously thought this ticket was a different problem,  
but

our analysis yesterday shows that it could be a real problem in the
openib BTL or ob1 PML (kinda think it's the openib btl because it
doesn't seem to happen on other networks, but who knows...).

Gleb is investigating.
Here is the result of the investigation. The problem is different  
than

#1015 ticket. What we have here is one rank calls isend() of a small
message and wait_all() in a loop and another one calls irecv(). The
problem is that isend() usually doesn't call opal_progress()  
anywhere
and wait_all() doesn't call progress if all requests are already  
c

Re: [OMPI devel] Multiworld MCA parameter values broken

2007-11-07 Thread Ralph H Castain

Sorry for delay - wasn't ignoring the issue.

There are several fixes to this problem - ranging in order from least to
most work:

1. just alias "ssh" to be "ssh -Y" and run without setting the mca param. It
won't affect anything on the backend because the daemon/procs don't use ssh.

2. include "pls_rsh_agent" in the array of mca params not to be passed to
the orted in orte/mca/pls/base/pls_base_general_support_fns.c, the
orte_pls_base_orted_append_basic_args function. This would fix the specific
problem cited here, but I admit that listing every such param by name would
get tedious.

3. we could easily detect that a "problem" character was in the mca param
value when we add it to the orted's argv, and then put "" around it. The
problem, however, is that the mca param parser on the far end doesn't remove
those "" from the resulting string. At least, I spent over a day fighting
with a problem only to discover that was happening. Could be an error in the
way I was doing things, or could be a real characteristic of the parser.
Anyway, we would have to ensure that the parser removes any surrounding ""
before passing along the param value or this won't work.

Ralph

On 11/5/07 12:10 PM, "Tim Prins"  wrote:

> Hi,
> 
> Commit 16364 broke things when using multiword mca param values. For
> instance:
> 
> mpirun --debug-daemons -mca orte_debug 1 -mca pls rsh -mca pls_rsh_agent
> "ssh -Y" xterm
> 
> Will crash and burn, because the value "ssh -Y" is being stored into the
> argv orted_cmd_line in orterun.c:1506. This is then added to the launch
> command for the orted:
> 
> /usr/bin/ssh -Y odin004  PATH=/san/homedirs/tprins/usr/rsl/bin:$PATH ;
> export PATH ; 
> LD_LIBRARY_PATH=/san/homedirs/tprins/usr/rsl/lib:$LD_LIBRARY_PATH ;
> export LD_LIBRARY_PATH ; /san/homedirs/tprins/usr/rsl/bin/orted --debug
> --debug-daemons --name 0.1 --num_procs 2 --vpid_start 0 --nodename
> odin004 --universe tpr...@odin.cs.indiana.edu:default-universe-27872
> --nsreplica 
> "0.0;tcp://129.79.240.100:40907;tcp6://2001:18e8:2:240:2e0:81ff:fe2d:21a0:4090
> 8" 
> --gprreplica 
> "0.0;tcp://129.79.240.100:40907;tcp6://2001:18e8:2:240:2e0:81ff:fe2d:21a0:4090
> 8" 
> -mca orte_debug 1 -mca pls_rsh_agent ssh -Y -mca
> mca_base_param_file_path
> /u/tprins/usr/rsl/share/openmpi/amca-param-sets:/san/homedirs/tprins/rsl/examp
> les 
> -mca mca_base_param_file_path_force /san/homedirs/tprins/rsl/examples
> 
> Notice that in this command we now have "-mca pls_rsh_agent ssh -Y". So
> the quotes have been lost, as we die a horrible death.
> 
> So we need to add the quotes back in somehow, or pass these options
> differently. I'm not sure what the best way to fix this.
> 
> Thanks,
> 
> Tim

[OMPI devel] carto framework requirements

2007-11-07 Thread Sharon Melamed

Hi,

 

I wrote some SW requirements for the caro framework.

Please review this and post comments.

 

Thanks.

 

Sharon.



carto_framework_requirements.pdf
Description: carto_framework_requirements.pdf

Re: [OMPI devel] collective problems

Re: [OMPI devel] collective problems

Re: [OMPI devel] collective problems

Re: [OMPI devel] collective problems

Re: [OMPI devel] collective problems

Re: [OMPI devel] collective problems

Re: [OMPI devel] collective problems

Re: [OMPI devel] collective problems

Re: [OMPI devel] Multiworld MCA parameter values broken

Re: [OMPI devel] Multiworld MCA parameter values broken

Re: [OMPI devel] accessors to context id and message id's

Re: [OMPI devel] collective problems

Re: [OMPI devel] collective problems

Re: [OMPI devel] collective problems

Re: [OMPI devel] collective problems

[OMPI devel] Incorrect one-sided test

Re: [OMPI devel] v1.2 branch mpi_preconnect_all

Re: [OMPI devel] collective problems

Re: [OMPI devel] Multiworld MCA parameter values broken

[OMPI devel] carto framework requirements

20 matches

Site Navigation

Mail list logo

Footer information