Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r16691

2007-11-08 Thread Adrian Knoth
On Thu, Nov 08, 2007 at 08:02:09AM -0500, Jeff Squyres wrote:

> >> All ROMIO patches *must* be coordinated with the ROMIO maintainers.
> > Upstream? That's the upstream patch.
> That was extracted from ROMIO itself?  Which release?

>From Jiri:


The patch was extracted from a ROMIO sources that come with MPICH2 1.0.6.

As noted on the ROMIO web page (http://www-unix.mcs.anl.gov/romio/):

"Note: The version of ROMIO described on this page is an old one. We
haven't released newer versions of ROMIO as independent packages for a
while; they were included as part of MPICH2 and MPICH-1. You can get the
latest version of ROMIO when you download MPICH2 or MPICH-1."


--- end of Jiri ---

> > Jiri Polach has extracted the fix for this problem. Updating OMPI to a
> > newer ROMIO version should do the trick, so we might want to revert
> > r16693 and r16691.
> It would be great to upgrade to a newer version of ROMIO.  Do you have  
> the cycles to do it?

Let's see ;) If life is going to be boring, I'll have a look at it ;)

> If this is slated for v1.3, then I think it would be much better to  
> back out that patch and then do a real upgrade.

ACK.


-- 
Cluster and Metacomputing Working Group
Friedrich-Schiller-Universität Jena, Germany

private: http://adi.thur.de


Re: [OMPI devel] collective problems

2007-11-08 Thread Patrick Geoffray

Hi Gleb,

Gleb Natapov wrote:

In the case of TCP, kernel is kind enough to progress message for you,
but only if there was enough space in a kernel internal buffers. If there
was no place there, TCP BTL will also buffer messages in userspace and
will, eventually, have the same problem.


Occasionally buffering to hide flow-control issue is fine, assuming that 
there is a mechanism to flush the buffer (below). However, you cannot 
buffer everything and it is just as fine to expose the back pressure 
when the buffer space is exhausted, to show the application that there 
is a sustained problem. In this case, it is reasonable to block the 
application (ie the MPI request) while you cannot buffer the outgoing data.


The problem of the progression of already buffered outgoing data is the 
real problem, not the buffering itself.


Here, the proposal is to allow the BTL to buffer, but requires the PML 
to handle progress. That's broken, IMHO.



To progress such outstanding messages additional thread is needed in
userspace. Is this what MX does?


MX uses user-level thread but it's mainly for progressing the 
higher-level protocol on the receive side. On the send side for the 
low-level protocol, it is easier to ask your driver to either wake you 
up when the sending resource is available again (blocking on a CQ for 
IB) or take care of the sending itself.



My overall problem with this proposal is a race to the bottom, based on 
the lowest BTL, functionality-wise. The PML already imposes a pipelining 
for large messages (with a few knobs, but still) when most protocols in 
other BTLs already have their own. Now it's flow-control progression 
(not MPI progression).


Can each BTL implement what is needed for a particular back-end instead 
of bloating the upper layer ?



Patrick


Re: [OMPI devel] Multiworld MCA parameter values broken

2007-11-08 Thread Ralph H Castain
Might I suggest:

https://svn.open-mpi.org/trac/ompi/ticket/1073

It deals with some of these issues and explains the boundaries of the
problem. As for what a string param can contain, I have no opinion. I only
note that it must handle special characters such as ';', '/', etc. that are
typically found in uri's. I cannot think of any reason it should have a
quote in it.

Ralph



On 11/8/07 12:25 PM, "Tim Prins"  wrote:

> The alias option you presented does not work. I think we do some weird
> things to find the absolute path for ssh, instead of just issuing the
> command.
> 
> I would spend some time fixing this, but I don't want to do it wrong. We
> could quote all the param values, and change the parser to remove the
> quotes, but this is assuming that a mca param does not contain quotes.
> 
> So I guess there are 2 questions that need to be answered before a fix
> is made:
> 
> 1. What exactly can a string mca param contain? Can it have quotes or
> spaces or?
> 
> 2. Which mca parameters should be forwarded? Should it be just the ones
> from the command line? From the environment? From config files?
> 
> Tim
> 
> Ralph Castain wrote:
>> What changed is that we never passed mca params to the orted before - they
>> always went to the app, but it's the orted that has the issue. There is a
>> bug ticket thread on this subject - I forget the number immediately.
>> 
>> Basically, the problem was that we cannot generally pass the local
>> environment to the orteds when we launch them. However, people needed
>> various mca params to get to the orteds to control their behavior. The only
>> way to resolve that problem was to pass the params via the command line,
>> which is what was done.
>> 
>> Except for a very few cases, all of our mca params are single values that do
>> not include spaces, so this is not a problem that is causing widespread
>> issues. As I said, I already had to deal with one special case that didn't
>> involve spaces, but did have special characters that required quoting, which
>> identified the larger problem of dealing with quoted strings.
>> 
>> I have no objection to a more general fix. Like I said in my note, though,
>> the general fix will take a larger effort. If someone is willing to do so,
>> that is fine with me - I was only offering solutions that would fill the
>> interim time as I haven't heard anyone step up to say they would fix it
>> anytime soon.
>> 
>> Please feel free to jump in and volunteer! ;-) I'm willing to put the quotes
>> around things if you will fix the mca cmd line parser to cleanly remove them
>> on the other end.
>> 
>> Ralph
>> 
>> 
>> 
>> On 11/7/07 5:50 PM, "Tim Prins"  wrote:
>> 
>>> I'm curious what changed to make this a problem. How were we passing mca
>>> param
>>> from the base to the app before, and why did it change?
>>> 
>>> I think that options 1 & 2 below are no good, since we, in general, allow
>>> string mca params to have spaces (as far as I understand it). So a more
>>> general approach is needed.
>>> 
>>> Tim
>>> 
>>> On Wednesday 07 November 2007 10:40:45 am Ralph H Castain wrote:
 Sorry for delay - wasn't ignoring the issue.
 
 There are several fixes to this problem - ranging in order from least to
 most work:
 
 1. just alias "ssh" to be "ssh -Y" and run without setting the mca param.
 It won't affect anything on the backend because the daemon/procs don't use
 ssh.
 
 2. include "pls_rsh_agent" in the array of mca params not to be passed to
 the orted in orte/mca/pls/base/pls_base_general_support_fns.c, the
 orte_pls_base_orted_append_basic_args function. This would fix the specific
 problem cited here, but I admit that listing every such param by name would
 get tedious.
 
 3. we could easily detect that a "problem" character was in the mca param
 value when we add it to the orted's argv, and then put "" around it. The
 problem, however, is that the mca param parser on the far end doesn't
 remove those "" from the resulting string. At least, I spent over a day
 fighting with a problem only to discover that was happening. Could be an
 error in the way I was doing things, or could be a real characteristic of
 the parser. Anyway, we would have to ensure that the parser removes any
 surrounding "" before passing along the param value or this won't work.
 
 Ralph
 
 On 11/5/07 12:10 PM, "Tim Prins"  wrote:
> Hi,
> 
> Commit 16364 broke things when using multiword mca param values. For
> instance:
> 
> mpirun --debug-daemons -mca orte_debug 1 -mca pls rsh -mca pls_rsh_agent
> "ssh -Y" xterm
> 
> Will crash and burn, because the value "ssh -Y" is being stored into the
> argv orted_cmd_line in orterun.c:1506. This is then added to the launch
> command for the orted:
> 
> /usr/bin/ssh -Y odin004  PATH=/san/homedirs/tprins/usr/rsl/bin:$PATH ;
> export PATH ;
> LD_LIBRARY_PATH=/san

Re: [OMPI devel] Multiworld MCA parameter values broken

2007-11-08 Thread Tim Prins
The alias option you presented does not work. I think we do some weird 
things to find the absolute path for ssh, instead of just issuing the 
command.


I would spend some time fixing this, but I don't want to do it wrong. We 
could quote all the param values, and change the parser to remove the 
quotes, but this is assuming that a mca param does not contain quotes.


So I guess there are 2 questions that need to be answered before a fix 
is made:


1. What exactly can a string mca param contain? Can it have quotes or 
spaces or?


2. Which mca parameters should be forwarded? Should it be just the ones 
from the command line? From the environment? From config files?


Tim

Ralph Castain wrote:

What changed is that we never passed mca params to the orted before - they
always went to the app, but it's the orted that has the issue. There is a
bug ticket thread on this subject - I forget the number immediately.

Basically, the problem was that we cannot generally pass the local
environment to the orteds when we launch them. However, people needed
various mca params to get to the orteds to control their behavior. The only
way to resolve that problem was to pass the params via the command line,
which is what was done.

Except for a very few cases, all of our mca params are single values that do
not include spaces, so this is not a problem that is causing widespread
issues. As I said, I already had to deal with one special case that didn't
involve spaces, but did have special characters that required quoting, which
identified the larger problem of dealing with quoted strings.

I have no objection to a more general fix. Like I said in my note, though,
the general fix will take a larger effort. If someone is willing to do so,
that is fine with me - I was only offering solutions that would fill the
interim time as I haven't heard anyone step up to say they would fix it
anytime soon.

Please feel free to jump in and volunteer! ;-) I'm willing to put the quotes
around things if you will fix the mca cmd line parser to cleanly remove them
on the other end.

Ralph



On 11/7/07 5:50 PM, "Tim Prins"  wrote:


I'm curious what changed to make this a problem. How were we passing mca param
from the base to the app before, and why did it change?

I think that options 1 & 2 below are no good, since we, in general, allow
string mca params to have spaces (as far as I understand it). So a more
general approach is needed.

Tim

On Wednesday 07 November 2007 10:40:45 am Ralph H Castain wrote:

Sorry for delay - wasn't ignoring the issue.

There are several fixes to this problem - ranging in order from least to
most work:

1. just alias "ssh" to be "ssh -Y" and run without setting the mca param.
It won't affect anything on the backend because the daemon/procs don't use
ssh.

2. include "pls_rsh_agent" in the array of mca params not to be passed to
the orted in orte/mca/pls/base/pls_base_general_support_fns.c, the
orte_pls_base_orted_append_basic_args function. This would fix the specific
problem cited here, but I admit that listing every such param by name would
get tedious.

3. we could easily detect that a "problem" character was in the mca param
value when we add it to the orted's argv, and then put "" around it. The
problem, however, is that the mca param parser on the far end doesn't
remove those "" from the resulting string. At least, I spent over a day
fighting with a problem only to discover that was happening. Could be an
error in the way I was doing things, or could be a real characteristic of
the parser. Anyway, we would have to ensure that the parser removes any
surrounding "" before passing along the param value or this won't work.

Ralph

On 11/5/07 12:10 PM, "Tim Prins"  wrote:

Hi,

Commit 16364 broke things when using multiword mca param values. For
instance:

mpirun --debug-daemons -mca orte_debug 1 -mca pls rsh -mca pls_rsh_agent
"ssh -Y" xterm

Will crash and burn, because the value "ssh -Y" is being stored into the
argv orted_cmd_line in orterun.c:1506. This is then added to the launch
command for the orted:

/usr/bin/ssh -Y odin004  PATH=/san/homedirs/tprins/usr/rsl/bin:$PATH ;
export PATH ;
LD_LIBRARY_PATH=/san/homedirs/tprins/usr/rsl/lib:$LD_LIBRARY_PATH ;
export LD_LIBRARY_PATH ; /san/homedirs/tprins/usr/rsl/bin/orted --debug
--debug-daemons --name 0.1 --num_procs 2 --vpid_start 0 --nodename
odin004 --universe tpr...@odin.cs.indiana.edu:default-universe-27872
--nsreplica
"0.0;tcp://129.79.240.100:40907;tcp6://2001:18e8:2:240:2e0:81ff:fe2d:21a0
:4090 8"
--gprreplica
"0.0;tcp://129.79.240.100:40907;tcp6://2001:18e8:2:240:2e0:81ff:fe2d:21a0
:4090 8"
-mca orte_debug 1 -mca pls_rsh_agent ssh -Y -mca
mca_base_param_file_path
/u/tprins/usr/rsl/share/openmpi/amca-param-sets:/san/homedirs/tprins/rsl/
examp les
-mca mca_base_param_file_path_force /san/homedirs/tprins/rsl/examples

Notice that in this command we now have "-mca pls_rsh_agent ssh -Y". So
the quotes have been lost, as we die a horrible death.

So we nee

Re: [OMPI devel] Moving fragments in btl sm

2007-11-08 Thread George Bosilca
The real memory copy happen in the convertor, more specifically in the  
ompi_convertor_pack for the sender and in the ompi_convertor_unpack  
for the receiver. In fact, none of the BTL directly call memcpy, all  
memory movements are done via the convertor.


  george.

On Nov 8, 2007, at 7:38 AM, Torje Henriksen wrote:


Hi,

I have a question that I shouldn't need to ask, but I'm
kind of lost in the code.

The btl sm component is using the circular buffers to write and read
fragments (sending and receiving).

In the write_to_head and read_from_tail I can only see pointers  
beeing set,
no data being moved. So where does the actual data movement/copying  
take

place? I'm thinking maybe a callback function existing somewhere :)


Thank you for your help now and earlier.


Best regards,

Torje Henriksen
(tor...@stud.cs.uit.no)

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




smime.p7s
Description: S/MIME cryptographic signature


Re: [OMPI devel] Moving fragments in btl sm

2007-11-08 Thread Li-Ta Lo
On Thu, 2007-11-08 at 13:38 +0100, Torje Henriksen wrote:
> Hi,
> 
> I have a question that I shouldn't need to ask, but I'm 
> kind of lost in the code.
> 
> The btl sm component is using the circular buffers to write and read 
> fragments (sending and receiving).
> 
> In the write_to_head and read_from_tail I can only see pointers beeing set, 
> no data being moved. So where does the actual data movement/copying take 
> place? I'm thinking maybe a callback function existing somewhere :)
> 
> 
> Thank you for your help now and earlier.
> 

You are right. The "real thing" happens at the
mca_btl_sm_component_progess(). The PML/BML will call btl_register() 
to register callback function to be called when a frag is received.
In the event loop, the progress() function is called periodically to
check if there is any new frag arrived. It is complicated a little
bit by the fact that to transmit each "data" frag, there is a round
trip and two "frags" are exchanged. The send side sends the "data"
frag with header type SEND to the receiver. The receiver calls the
callback function the handle the frag and send back an ACK frag. Upon
receiving the ACK frag, the send side calls the des_cbfunc() to tell
the upper layer the the sending of this frag is completed. 

BTW, it looks like it is still list append/remove in the PML/BML layer.
I don't know when/where the real "copying" happens.

Ollie




Re: [OMPI devel] collective problems

2007-11-08 Thread George Bosilca
Decrease the latency is the main reason. If we delay the MPI  
completion, then we always have to call opal_progress at least once in  
order to allow the BTL to trigger the callback. In the current  
implementation, we never call opal_progress on small messages, unless  
there is some kind of resource starvation.


  Thanks,
george.

On Nov 8, 2007, at 11:09 AM, Andrew Friedley wrote:


Brian Barrett wrote:

Personally, I'd rather just not mark MPI completion until a local
completion callback from the BTL.  But others don't like that idea,  
so

we came up with a way for back pressure from the BTL to say "it's not
on the wire yet".  This is more complicated than just not marking MPI
completion early, but why would we do something that helps real apps
at the expense of benchmarks?  That would just be silly!


FWIW this issue is also very relevant for the UD BTL, especially with
some new work I've done in the last week (currently having problems  
with

send-side completion semantics).  I missed it, what was the reasoning
for not marking MPI completion until a callback from the BTL?

Andrew
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




smime.p7s
Description: S/MIME cryptographic signature


Re: [OMPI devel] collective problems

2007-11-08 Thread Andrew Friedley



Brian Barrett wrote:
 > Personally, I'd rather just not mark MPI completion until a local
completion callback from the BTL.  But others don't like that idea, so  
we came up with a way for back pressure from the BTL to say "it's not  
on the wire yet".  This is more complicated than just not marking MPI  
completion early, but why would we do something that helps real apps  
at the expense of benchmarks?  That would just be silly!


FWIW this issue is also very relevant for the UD BTL, especially with 
some new work I've done in the last week (currently having problems with 
send-side completion semantics).  I missed it, what was the reasoning 
for not marking MPI completion until a callback from the BTL?


Andrew


Re: [OMPI devel] Release wiki pages

2007-11-08 Thread Tim Mattox
Thanks Jeff!

On Nov 8, 2007 9:07 AM, Jeff Squyres  wrote:
> I literally just discovered that the trac "milestone" pages can be
> edited.
>
> This seems like a much better place to put the 1.1, 1.2, and 1.3
> release series wiki pages.  So I moved all the content and updated the
> links on the front wiki page.  Each "1.x" milestone is now a top-level
> view of the entire series with links to individual milestone pages for
> details about that specific release.  For example:
>
>  https://svn.open-mpi.org/trac/ompi/milestone/Open%20MPI%201.2
>  https://svn.open-mpi.org/trac/ompi/milestone/Open%20MPI%201.3
>
> Release Managers / Gatekeepers:
> - the pages are editable just like other wiki pages, so nothing
> changes there
> - when you edit the milestone page, I note that there's a handy
> "Retarget associated open tickets to milestone [X]" option for moving
> all leftover tickets to the next milestone
>
> --
> Jeff Squyres
> Cisco Systems
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
 tmat...@gmail.com || timat...@open-mpi.org
I'm a bright... http://www.the-brights.net/


Re: [OMPI devel] collective problems

2007-11-08 Thread Richard Graham



On 11/8/07 4:03 AM, "Gleb Natapov"  wrote:

> On Wed, Nov 07, 2007 at 11:25:43PM -0500, Patrick Geoffray wrote:
>> Richard Graham wrote:
>>> The real problem, as you and others have pointed out is the lack of
>>> predictable time slices for the progress engine to do its work, when relying
>>> on the ULP to make calls into the library...
>> 
>> The real, real problem is that the BTL should handle progression at
>> their level, specially when the buffering is due to BTL-level flow
>> control. When I write something into a socket, TCP will take care of
>> sending it eventually, for example.
> In the case of TCP, kernel is kind enough to progress message for you,
> but only if there was enough space in a kernel internal buffers. If there
> was no place there, TCP BTL will also buffer messages in userspace and
> will, eventually, have the same problem.
> 
> To progress such outstanding messages additional thread is needed in
> userspace. Is this what MX does?

Yes - this is the bottom line, with the current problem the high cost of
scheduling such threads at some sort of reasonable frequency.

Rich

> 
> --
> Gleb.
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



[OMPI devel] Release wiki pages

2007-11-08 Thread Jeff Squyres
I literally just discovered that the trac "milestone" pages can be  
edited.


This seems like a much better place to put the 1.1, 1.2, and 1.3  
release series wiki pages.  So I moved all the content and updated the  
links on the front wiki page.  Each "1.x" milestone is now a top-level  
view of the entire series with links to individual milestone pages for  
details about that specific release.  For example:


https://svn.open-mpi.org/trac/ompi/milestone/Open%20MPI%201.2
https://svn.open-mpi.org/trac/ompi/milestone/Open%20MPI%201.3

Release Managers / Gatekeepers:
- the pages are editable just like other wiki pages, so nothing  
changes there
- when you edit the milestone page, I note that there's a handy  
"Retarget associated open tickets to milestone [X]" option for moving  
all leftover tickets to the next milestone


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] accessors to context id and message id's

2007-11-08 Thread Terry Dontje

George Bosilca wrote:


On Nov 6, 2007, at 8:38 AM, Terry Dontje wrote:


George Bosilca wrote:

If I understand correctly your question, then we don't need any
extension. Each request has a unique ID (from PERUSE perspective).
However, if I remember well this is only half implemented in our
PERUSE layer (i.e. it works only for expected requests).

Looking at the peruse macros it looks to be that the unique ID is the
base_req address which I imagine rarely matches between processes.


That's a completely different topic. If what you need is a unique ID 
for each request between processes, in other words, a unique ID for 
each message, then here is the way to go. Use the same information as 
the MPI matching logic, i.e. (comm_id, remote, tag) to create an 
identifier for each message. It will not be unique as multiple 
messages can generate the same ID, but you can generate a unique ID 
per messages with easy tricks.


I understand that one could try and rely on the order of the message 
being sent and received however this only works if you ultimately 
capture every message  which is something I would like to avoid.  My 
hope was to use something already embedded into the library not having 
to add more crap on top of the library.  This seems like something that 
would be useful to any tracing utility (like vampir).  However, I 
imagine the arguement against such a thing is that not all MPI Librarys 
would support such an ID thus making this a one off.
The PERUSE standard requires that the ID is unique for each process, 
and for the lifetime of the request. It does not require that the ID 
be unique across processes. And this is why we're using the base_req 
as an ID.


I understand that the PERUSE spec did not define the ID to be unique 
across processes which is why I was surprised by your answer.  Score one 
for miscommunications.  It would have been nice if the PERUSE committee 
would have provided an option for an implementation to expose message ids.


--td

  george.





This should be quite easy to fix, if someone invest few hours into it.

For the context id, a user can always use the c2f function to get the
fortran ID (which for Open MPI is the communicator ID).


Cool, I didn't realize that.

thanks,

--td

 Thanks,
   george.

On Nov 5, 2007, at 8:01 AM, Terry Dontje wrote:

Currently in order to do message tracing one either has to rely on 
some
error prone postprocessing of data or replicating some MPI 
internals up

in the PMPI layer.  It would help Sun's tools group (and I believe U
Dresden also) if Open MPI would create a couple APIs that exoposed the
following:

1. PML Message ids used for a request
2. Context id for a specific communicator

I could see a couple ways of providing this information.  Either by
extending the PERUSE probes or creating actual functions that one 
would

pass in a request handle or communicator handle to get the appropriate
data back.

This is just a thought right now which why this email is not in an RFC
format.  I wanted to get a feel from the community as to the 
interest in
such APIs and if anyone may have specific issues with us providing 
such
interfaces.  If the responses seems positive I will follow this 
message

up with an RFC.

thanks,

--td
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


 



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
  




Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r16691

2007-11-08 Thread Jeff Squyres

On Nov 8, 2007, at 7:57 AM, Adrian Knoth wrote:


All ROMIO patches *must* be coordinated with the ROMIO maintainers.


Upstream? That's the upstream patch.


That was extracted from ROMIO itself?  Which release?


Jiri Polach has extracted the fix for this problem. Updating OMPI to a
newer ROMIO version should do the trick, so we might want to revert
r16693 and r16691.


It would be great to upgrade to a newer version of ROMIO.  Do you have  
the cycles to do it?


If this is slated for v1.3, then I think it would be much better to  
back out that patch and then do a real upgrade.  I have a few ideas  
about making the integration easier (e.g., forget the whole idea of  
renaming files -- it was a good idea but has a) turned out to not  
really be necessary in practice [even though theoretically it's still  
the Right Thing to Do], and b) it's a giant PITA for continual  
integration), and Rob Latham has indicated that he was going to put  
ROMIO in its own SVN which might make 1 or 2 of the integration issues  
easier (but we're certainly not going to grab random snapshots :-) ).   
There was a short mail thread about this a while ago on this list.


I'd be happy to point someone in the right direction for ROMIO  
maintenance, but I do not have the cycles to do this at the moment.   
Probably not until the January/February timeframe, unfortunately...


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r16691

2007-11-08 Thread Adrian Knoth
On Thu, Nov 08, 2007 at 07:51:28AM -0500, Jeff Squyres wrote:

[r16691]
> Whoa; I'm not sure we want to apply this.

Me neither.

> All ROMIO patches *must* be coordinated with the ROMIO maintainers.   

Upstream? That's the upstream patch.

Jiri Polach has extracted the fix for this problem. Updating OMPI to a
newer ROMIO version should do the trick, so we might want to revert
r16693 and r16691.

You decide.

-- 
Cluster and Metacomputing Working Group
Friedrich-Schiller-Universität Jena, Germany

private: http://adi.thur.de


Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r16691

2007-11-08 Thread Jeff Squyres

Whoa; I'm not sure we want to apply this.

All ROMIO patches *must* be coordinated with the ROMIO maintainers.   
Otherwise this becomes a complete nightmare of logistics.  There's  
already a few other ROMIO patches that we have consciously chosen not  
to apply because of the tangled issues that arise because of it, such  
as:


- "what version of ROMIO is in OMPI?"
- "do you have patch X?"
- ...etc.

Hence, it is best to coordinate all ROMIO patches with the upstream  
ROMIO maintainers.




On Nov 8, 2007, at 7:44 AM, a...@osl.iu.edu wrote:


Author: adi
Date: 2007-11-08 07:44:10 EST (Thu, 08 Nov 2007)
New Revision: 16691
URL: https://svn.open-mpi.org/trac/ompi/changeset/16691

Log:
upstream patch, provided by Jiri Polach. Re #733

Text files modified:
  trunk/ompi/mca/io/romio/romio/adio/common/ad_open.c |32 +++ 
+

  1 files changed, 32 insertions(+), 0 deletions(-)

Modified: trunk/ompi/mca/io/romio/romio/adio/common/ad_open.c
= 
= 
= 
= 
= 
= 
= 
= 
==

--- trunk/ompi/mca/io/romio/romio/adio/common/ad_open.c (original)
+++ trunk/ompi/mca/io/romio/romio/adio/common/ad_open.c	2007-11-08  
07:44:10 EST (Thu, 08 Nov 2007)

@@ -172,6 +172,37 @@
 */
/* pvfs2 handles opens specially, so it is actually more  
efficent for that

 * file system if we skip this optimization */
+/* NFS handles opens especially poorly, so we cannot use this  
optimization

+ * on that FS */
+if (fd->file_system == ADIO_NFS) {
+/* no optimizations for NFS: */
+   if ((access_mode & ADIO_CREATE) && (access_mode &  
ADIO_EXCL)) {
+ /* the open should fail if the file exists. Only *1*  
process should
+  check this. Otherwise, if all processes try to check and  
the file
+  does not exist, one process will create the file and  
others who

+  reach later will return error. */
+   if(rank == fd->hints->ranklist[0]) {
+fd->access_mode = access_mode;
+(*(fd->fns->ADIOI_xxx_Open))(fd, error_code);
+MPI_Bcast(error_code, 1, MPI_INT, \
+fd->hints->ranklist[0], fd->comm);
+/* if no error, close the file and reopen normally  
below */

+if (*error_code == MPI_SUCCESS)
+(*(fd->fns->ADIOI_xxx_Close))(fd,  
error_code);

+   }
+   else MPI_Bcast(error_code, 1, MPI_INT,
+   fd->hints->ranklist[0], fd->comm);
+   if (*error_code != MPI_SUCCESS) {
+   goto fn_exit;
+   }
+   else {
+   /* turn off EXCL for real open */
+   access_mode = access_mode ^ ADIO_EXCL;
+   }
+}
+} else {
+
+   /* the actual optimized create on one, open on all */
if (access_mode & ADIO_CREATE && fd->file_system != ADIO_PVFS2) {
   if(rank == fd->hints->ranklist[0]) {
   /* remove delete_on_close flag if set */
@@ -201,6 +232,7 @@
   access_mode ^= ADIO_EXCL;
   }
}
+}

/* if we are doing deferred open, non-aggregators should return  
now */

if (fd->hints->deferred_open ) {
___
svn-full mailing list
svn-f...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/svn-full



--
Jeff Squyres
Cisco Systems



[OMPI devel] Moving fragments in btl sm

2007-11-08 Thread Torje Henriksen

Hi,

I have a question that I shouldn't need to ask, but I'm 
kind of lost in the code.


The btl sm component is using the circular buffers to write and read 
fragments (sending and receiving).


In the write_to_head and read_from_tail I can only see pointers beeing set, 
no data being moved. So where does the actual data movement/copying take 
place? I'm thinking maybe a callback function existing somewhere :)



Thank you for your help now and earlier.


Best regards,

Torje Henriksen
(tor...@stud.cs.uit.no)



Re: [OMPI devel] collective problems

2007-11-08 Thread Gleb Natapov
On Wed, Nov 07, 2007 at 11:25:43PM -0500, Patrick Geoffray wrote:
> Richard Graham wrote:
> > The real problem, as you and others have pointed out is the lack of
> > predictable time slices for the progress engine to do its work, when relying
> > on the ULP to make calls into the library...
> 
> The real, real problem is that the BTL should handle progression at 
> their level, specially when the buffering is due to BTL-level flow 
> control. When I write something into a socket, TCP will take care of 
> sending it eventually, for example.
In the case of TCP, kernel is kind enough to progress message for you,
but only if there was enough space in a kernel internal buffers. If there
was no place there, TCP BTL will also buffer messages in userspace and
will, eventually, have the same problem.

To progress such outstanding messages additional thread is needed in
userspace. Is this what MX does?

--
Gleb.


Re: [OMPI devel] collective problems

2007-11-08 Thread Gleb Natapov
On Wed, Nov 07, 2007 at 01:16:04PM -0500, George Bosilca wrote:
>
> On Nov 7, 2007, at 12:51 PM, Jeff Squyres wrote:
>
>>> The same callback is called in both cases. In the case that you
>>> described, the callback is called just a little bit deeper into the
>>> recursion, when in the "normal case" it will get called from the
>>> first level of the recursion. Or maybe I miss something here ...
>>
>> Right -- it's not the callback that is the problem.  It's when the
>> recursion is unwound and further up the stack you now have a stale
>> request.
>
> That's exactly the point that I fail to see. If the request is freed in the 
> PML callback, then it should get release in both cases, and therefore lead 
> to problems all the time. Which, obviously, is not true when we do not have 
> this deep recursion thing going on.
>
> Moreover, he request management is based on the reference count. The PML 
> level have one ref count and the MPI level have another one. In fact, we 
> cannot release a request until we explicitly call ompi_request_free on it. 
> The place where this call happens is different between the blocking and non 
> blocking calls. In the non blocking case the ompi_request_free get called 
> from the *_test (*_wait) functions while in the blocking case it get called 
> directly from the MPI_Send function.
>
> Let me summarize: a request cannot reach a stale state without a call to 
> ompi_request_free. This function is never called directly from the PML 
> level. Therefore, the recursion depth should not have any impact on the 
> state of the request !

I looked at the code one more time and it seems to me now that George is
absolutely right. The scenario I described cannot happen because we call
ompi_request_free() at the top of the stack. I somehow had an
impression that we mark internal requests as freed before calling
send(). So I'll go and implement NOT_ON_WIRE extension when I'll have
time for it.

--
Gleb.


Re: [OMPI devel] collective problems

2007-11-08 Thread Gleb Natapov
On Wed, Nov 07, 2007 at 09:07:23PM -0700, Brian Barrett wrote:
> Personally, I'd rather just not mark MPI completion until a local  
> completion callback from the BTL.  But others don't like that idea, so  
> we came up with a way for back pressure from the BTL to say "it's not  
> on the wire yet".  This is more complicated than just not marking MPI  
> completion early, but why would we do something that helps real apps  
> at the expense of benchmarks?  That would just be silly!
> 
I fully agree with Brian here. Trying to solve the issue with current
approach will introduce additional checking in the fast path and will
only hurt real apps.

> Brian
> 
> On Nov 7, 2007, at 7:56 PM, Richard Graham wrote:
> 
> > Does this mean that we don’t have a queue to store btl level  
> > descriptors that
> >  are only partially complete ?  Do we do an all or nothing with  
> > respect to btl
> >  level requests at this stage ?
> >
> > Seems to me like we want to mark things complete at the MPI level  
> > ASAP, and
> >  that this proposal is not to do that – is this correct ?
> >
> > Rich
> >
> >
> > On 11/7/07 11:26 PM, "Jeff Squyres"  wrote:
> >
> >> On Nov 7, 2007, at 9:33 PM, Patrick Geoffray wrote:
> >>
> >> >> Remember that this is all in the context of Galen's proposal for
> >> >> btl_send() to be able to return NOT_ON_WIRE -- meaning that the  
> >> send
> >> >> was successful, but it has not yet been sent (e.g., openib BTL
> >> >> buffered it because it ran out of credits).
> >> >
> >> > Sorry if I miss something obvious, but why does the PML has to be
> >> > aware
> >> > of the flow control situation of the BTL ? If the BTL cannot send
> >> > something right away for any reason, it should be the  
> >> responsibility
> >> > of
> >> > the BTL to buffer it and to progress on it later.
> >>
> >>
> >> That's currently the way it is.  But the BTL currently only has the
> >> option to say two things:
> >>
> >> 1. "ok, done!" -- then the PML will think that the request is  
> >> complete
> >> 2. "doh -- error!" -- then the PML thinks that Something Bad
> >> Happened(tm)
> >>
> >> What we really need is for the BTL to have a third option:
> >>
> >> 3. "not done yet!"
> >>
> >> So that the PML knows that the request is not yet done, but will  
> >> allow
> >> other things to progress while we're waiting for it to complete.
> >> Without this, the openib BTL currently replies "ok, done!", even when
> >> it has only buffered a message (rather than actually sending it out).
> >> This optimization works great (yeah, I know...) except for apps that
> >> don't dip into the MPI library frequently.  :-\
> >>
> >> --
> >> Jeff Squyres
> >> Cisco Systems
> >>
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>
> >
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Gleb.