Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r16691
On Thu, Nov 08, 2007 at 08:02:09AM -0500, Jeff Squyres wrote: > >> All ROMIO patches *must* be coordinated with the ROMIO maintainers. > > Upstream? That's the upstream patch. > That was extracted from ROMIO itself? Which release? >From Jiri: The patch was extracted from a ROMIO sources that come with MPICH2 1.0.6. As noted on the ROMIO web page (http://www-unix.mcs.anl.gov/romio/): "Note: The version of ROMIO described on this page is an old one. We haven't released newer versions of ROMIO as independent packages for a while; they were included as part of MPICH2 and MPICH-1. You can get the latest version of ROMIO when you download MPICH2 or MPICH-1." --- end of Jiri --- > > Jiri Polach has extracted the fix for this problem. Updating OMPI to a > > newer ROMIO version should do the trick, so we might want to revert > > r16693 and r16691. > It would be great to upgrade to a newer version of ROMIO. Do you have > the cycles to do it? Let's see ;) If life is going to be boring, I'll have a look at it ;) > If this is slated for v1.3, then I think it would be much better to > back out that patch and then do a real upgrade. ACK. -- Cluster and Metacomputing Working Group Friedrich-Schiller-Universität Jena, Germany private: http://adi.thur.de
Re: [OMPI devel] collective problems
Hi Gleb, Gleb Natapov wrote: In the case of TCP, kernel is kind enough to progress message for you, but only if there was enough space in a kernel internal buffers. If there was no place there, TCP BTL will also buffer messages in userspace and will, eventually, have the same problem. Occasionally buffering to hide flow-control issue is fine, assuming that there is a mechanism to flush the buffer (below). However, you cannot buffer everything and it is just as fine to expose the back pressure when the buffer space is exhausted, to show the application that there is a sustained problem. In this case, it is reasonable to block the application (ie the MPI request) while you cannot buffer the outgoing data. The problem of the progression of already buffered outgoing data is the real problem, not the buffering itself. Here, the proposal is to allow the BTL to buffer, but requires the PML to handle progress. That's broken, IMHO. To progress such outstanding messages additional thread is needed in userspace. Is this what MX does? MX uses user-level thread but it's mainly for progressing the higher-level protocol on the receive side. On the send side for the low-level protocol, it is easier to ask your driver to either wake you up when the sending resource is available again (blocking on a CQ for IB) or take care of the sending itself. My overall problem with this proposal is a race to the bottom, based on the lowest BTL, functionality-wise. The PML already imposes a pipelining for large messages (with a few knobs, but still) when most protocols in other BTLs already have their own. Now it's flow-control progression (not MPI progression). Can each BTL implement what is needed for a particular back-end instead of bloating the upper layer ? Patrick
Re: [OMPI devel] Multiworld MCA parameter values broken
Might I suggest: https://svn.open-mpi.org/trac/ompi/ticket/1073 It deals with some of these issues and explains the boundaries of the problem. As for what a string param can contain, I have no opinion. I only note that it must handle special characters such as ';', '/', etc. that are typically found in uri's. I cannot think of any reason it should have a quote in it. Ralph On 11/8/07 12:25 PM, "Tim Prins" wrote: > The alias option you presented does not work. I think we do some weird > things to find the absolute path for ssh, instead of just issuing the > command. > > I would spend some time fixing this, but I don't want to do it wrong. We > could quote all the param values, and change the parser to remove the > quotes, but this is assuming that a mca param does not contain quotes. > > So I guess there are 2 questions that need to be answered before a fix > is made: > > 1. What exactly can a string mca param contain? Can it have quotes or > spaces or? > > 2. Which mca parameters should be forwarded? Should it be just the ones > from the command line? From the environment? From config files? > > Tim > > Ralph Castain wrote: >> What changed is that we never passed mca params to the orted before - they >> always went to the app, but it's the orted that has the issue. There is a >> bug ticket thread on this subject - I forget the number immediately. >> >> Basically, the problem was that we cannot generally pass the local >> environment to the orteds when we launch them. However, people needed >> various mca params to get to the orteds to control their behavior. The only >> way to resolve that problem was to pass the params via the command line, >> which is what was done. >> >> Except for a very few cases, all of our mca params are single values that do >> not include spaces, so this is not a problem that is causing widespread >> issues. As I said, I already had to deal with one special case that didn't >> involve spaces, but did have special characters that required quoting, which >> identified the larger problem of dealing with quoted strings. >> >> I have no objection to a more general fix. Like I said in my note, though, >> the general fix will take a larger effort. If someone is willing to do so, >> that is fine with me - I was only offering solutions that would fill the >> interim time as I haven't heard anyone step up to say they would fix it >> anytime soon. >> >> Please feel free to jump in and volunteer! ;-) I'm willing to put the quotes >> around things if you will fix the mca cmd line parser to cleanly remove them >> on the other end. >> >> Ralph >> >> >> >> On 11/7/07 5:50 PM, "Tim Prins" wrote: >> >>> I'm curious what changed to make this a problem. How were we passing mca >>> param >>> from the base to the app before, and why did it change? >>> >>> I think that options 1 & 2 below are no good, since we, in general, allow >>> string mca params to have spaces (as far as I understand it). So a more >>> general approach is needed. >>> >>> Tim >>> >>> On Wednesday 07 November 2007 10:40:45 am Ralph H Castain wrote: Sorry for delay - wasn't ignoring the issue. There are several fixes to this problem - ranging in order from least to most work: 1. just alias "ssh" to be "ssh -Y" and run without setting the mca param. It won't affect anything on the backend because the daemon/procs don't use ssh. 2. include "pls_rsh_agent" in the array of mca params not to be passed to the orted in orte/mca/pls/base/pls_base_general_support_fns.c, the orte_pls_base_orted_append_basic_args function. This would fix the specific problem cited here, but I admit that listing every such param by name would get tedious. 3. we could easily detect that a "problem" character was in the mca param value when we add it to the orted's argv, and then put "" around it. The problem, however, is that the mca param parser on the far end doesn't remove those "" from the resulting string. At least, I spent over a day fighting with a problem only to discover that was happening. Could be an error in the way I was doing things, or could be a real characteristic of the parser. Anyway, we would have to ensure that the parser removes any surrounding "" before passing along the param value or this won't work. Ralph On 11/5/07 12:10 PM, "Tim Prins" wrote: > Hi, > > Commit 16364 broke things when using multiword mca param values. For > instance: > > mpirun --debug-daemons -mca orte_debug 1 -mca pls rsh -mca pls_rsh_agent > "ssh -Y" xterm > > Will crash and burn, because the value "ssh -Y" is being stored into the > argv orted_cmd_line in orterun.c:1506. This is then added to the launch > command for the orted: > > /usr/bin/ssh -Y odin004 PATH=/san/homedirs/tprins/usr/rsl/bin:$PATH ; > export PATH ; > LD_LIBRARY_PATH=/san
Re: [OMPI devel] Multiworld MCA parameter values broken
The alias option you presented does not work. I think we do some weird things to find the absolute path for ssh, instead of just issuing the command. I would spend some time fixing this, but I don't want to do it wrong. We could quote all the param values, and change the parser to remove the quotes, but this is assuming that a mca param does not contain quotes. So I guess there are 2 questions that need to be answered before a fix is made: 1. What exactly can a string mca param contain? Can it have quotes or spaces or? 2. Which mca parameters should be forwarded? Should it be just the ones from the command line? From the environment? From config files? Tim Ralph Castain wrote: What changed is that we never passed mca params to the orted before - they always went to the app, but it's the orted that has the issue. There is a bug ticket thread on this subject - I forget the number immediately. Basically, the problem was that we cannot generally pass the local environment to the orteds when we launch them. However, people needed various mca params to get to the orteds to control their behavior. The only way to resolve that problem was to pass the params via the command line, which is what was done. Except for a very few cases, all of our mca params are single values that do not include spaces, so this is not a problem that is causing widespread issues. As I said, I already had to deal with one special case that didn't involve spaces, but did have special characters that required quoting, which identified the larger problem of dealing with quoted strings. I have no objection to a more general fix. Like I said in my note, though, the general fix will take a larger effort. If someone is willing to do so, that is fine with me - I was only offering solutions that would fill the interim time as I haven't heard anyone step up to say they would fix it anytime soon. Please feel free to jump in and volunteer! ;-) I'm willing to put the quotes around things if you will fix the mca cmd line parser to cleanly remove them on the other end. Ralph On 11/7/07 5:50 PM, "Tim Prins" wrote: I'm curious what changed to make this a problem. How were we passing mca param from the base to the app before, and why did it change? I think that options 1 & 2 below are no good, since we, in general, allow string mca params to have spaces (as far as I understand it). So a more general approach is needed. Tim On Wednesday 07 November 2007 10:40:45 am Ralph H Castain wrote: Sorry for delay - wasn't ignoring the issue. There are several fixes to this problem - ranging in order from least to most work: 1. just alias "ssh" to be "ssh -Y" and run without setting the mca param. It won't affect anything on the backend because the daemon/procs don't use ssh. 2. include "pls_rsh_agent" in the array of mca params not to be passed to the orted in orte/mca/pls/base/pls_base_general_support_fns.c, the orte_pls_base_orted_append_basic_args function. This would fix the specific problem cited here, but I admit that listing every such param by name would get tedious. 3. we could easily detect that a "problem" character was in the mca param value when we add it to the orted's argv, and then put "" around it. The problem, however, is that the mca param parser on the far end doesn't remove those "" from the resulting string. At least, I spent over a day fighting with a problem only to discover that was happening. Could be an error in the way I was doing things, or could be a real characteristic of the parser. Anyway, we would have to ensure that the parser removes any surrounding "" before passing along the param value or this won't work. Ralph On 11/5/07 12:10 PM, "Tim Prins" wrote: Hi, Commit 16364 broke things when using multiword mca param values. For instance: mpirun --debug-daemons -mca orte_debug 1 -mca pls rsh -mca pls_rsh_agent "ssh -Y" xterm Will crash and burn, because the value "ssh -Y" is being stored into the argv orted_cmd_line in orterun.c:1506. This is then added to the launch command for the orted: /usr/bin/ssh -Y odin004 PATH=/san/homedirs/tprins/usr/rsl/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/san/homedirs/tprins/usr/rsl/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; /san/homedirs/tprins/usr/rsl/bin/orted --debug --debug-daemons --name 0.1 --num_procs 2 --vpid_start 0 --nodename odin004 --universe tpr...@odin.cs.indiana.edu:default-universe-27872 --nsreplica "0.0;tcp://129.79.240.100:40907;tcp6://2001:18e8:2:240:2e0:81ff:fe2d:21a0 :4090 8" --gprreplica "0.0;tcp://129.79.240.100:40907;tcp6://2001:18e8:2:240:2e0:81ff:fe2d:21a0 :4090 8" -mca orte_debug 1 -mca pls_rsh_agent ssh -Y -mca mca_base_param_file_path /u/tprins/usr/rsl/share/openmpi/amca-param-sets:/san/homedirs/tprins/rsl/ examp les -mca mca_base_param_file_path_force /san/homedirs/tprins/rsl/examples Notice that in this command we now have "-mca pls_rsh_agent ssh -Y". So the quotes have been lost, as we die a horrible death. So we nee
Re: [OMPI devel] Moving fragments in btl sm
The real memory copy happen in the convertor, more specifically in the ompi_convertor_pack for the sender and in the ompi_convertor_unpack for the receiver. In fact, none of the BTL directly call memcpy, all memory movements are done via the convertor. george. On Nov 8, 2007, at 7:38 AM, Torje Henriksen wrote: Hi, I have a question that I shouldn't need to ask, but I'm kind of lost in the code. The btl sm component is using the circular buffers to write and read fragments (sending and receiving). In the write_to_head and read_from_tail I can only see pointers beeing set, no data being moved. So where does the actual data movement/copying take place? I'm thinking maybe a callback function existing somewhere :) Thank you for your help now and earlier. Best regards, Torje Henriksen (tor...@stud.cs.uit.no) ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI devel] Moving fragments in btl sm
On Thu, 2007-11-08 at 13:38 +0100, Torje Henriksen wrote: > Hi, > > I have a question that I shouldn't need to ask, but I'm > kind of lost in the code. > > The btl sm component is using the circular buffers to write and read > fragments (sending and receiving). > > In the write_to_head and read_from_tail I can only see pointers beeing set, > no data being moved. So where does the actual data movement/copying take > place? I'm thinking maybe a callback function existing somewhere :) > > > Thank you for your help now and earlier. > You are right. The "real thing" happens at the mca_btl_sm_component_progess(). The PML/BML will call btl_register() to register callback function to be called when a frag is received. In the event loop, the progress() function is called periodically to check if there is any new frag arrived. It is complicated a little bit by the fact that to transmit each "data" frag, there is a round trip and two "frags" are exchanged. The send side sends the "data" frag with header type SEND to the receiver. The receiver calls the callback function the handle the frag and send back an ACK frag. Upon receiving the ACK frag, the send side calls the des_cbfunc() to tell the upper layer the the sending of this frag is completed. BTW, it looks like it is still list append/remove in the PML/BML layer. I don't know when/where the real "copying" happens. Ollie
Re: [OMPI devel] collective problems
Decrease the latency is the main reason. If we delay the MPI completion, then we always have to call opal_progress at least once in order to allow the BTL to trigger the callback. In the current implementation, we never call opal_progress on small messages, unless there is some kind of resource starvation. Thanks, george. On Nov 8, 2007, at 11:09 AM, Andrew Friedley wrote: Brian Barrett wrote: Personally, I'd rather just not mark MPI completion until a local completion callback from the BTL. But others don't like that idea, so we came up with a way for back pressure from the BTL to say "it's not on the wire yet". This is more complicated than just not marking MPI completion early, but why would we do something that helps real apps at the expense of benchmarks? That would just be silly! FWIW this issue is also very relevant for the UD BTL, especially with some new work I've done in the last week (currently having problems with send-side completion semantics). I missed it, what was the reasoning for not marking MPI completion until a callback from the BTL? Andrew ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI devel] collective problems
Brian Barrett wrote: > Personally, I'd rather just not mark MPI completion until a local completion callback from the BTL. But others don't like that idea, so we came up with a way for back pressure from the BTL to say "it's not on the wire yet". This is more complicated than just not marking MPI completion early, but why would we do something that helps real apps at the expense of benchmarks? That would just be silly! FWIW this issue is also very relevant for the UD BTL, especially with some new work I've done in the last week (currently having problems with send-side completion semantics). I missed it, what was the reasoning for not marking MPI completion until a callback from the BTL? Andrew
Re: [OMPI devel] Release wiki pages
Thanks Jeff! On Nov 8, 2007 9:07 AM, Jeff Squyres wrote: > I literally just discovered that the trac "milestone" pages can be > edited. > > This seems like a much better place to put the 1.1, 1.2, and 1.3 > release series wiki pages. So I moved all the content and updated the > links on the front wiki page. Each "1.x" milestone is now a top-level > view of the entire series with links to individual milestone pages for > details about that specific release. For example: > > https://svn.open-mpi.org/trac/ompi/milestone/Open%20MPI%201.2 > https://svn.open-mpi.org/trac/ompi/milestone/Open%20MPI%201.3 > > Release Managers / Gatekeepers: > - the pages are editable just like other wiki pages, so nothing > changes there > - when you edit the milestone page, I note that there's a handy > "Retarget associated open tickets to milestone [X]" option for moving > all leftover tickets to the next milestone > > -- > Jeff Squyres > Cisco Systems > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/ tmat...@gmail.com || timat...@open-mpi.org I'm a bright... http://www.the-brights.net/
Re: [OMPI devel] collective problems
On 11/8/07 4:03 AM, "Gleb Natapov" wrote: > On Wed, Nov 07, 2007 at 11:25:43PM -0500, Patrick Geoffray wrote: >> Richard Graham wrote: >>> The real problem, as you and others have pointed out is the lack of >>> predictable time slices for the progress engine to do its work, when relying >>> on the ULP to make calls into the library... >> >> The real, real problem is that the BTL should handle progression at >> their level, specially when the buffering is due to BTL-level flow >> control. When I write something into a socket, TCP will take care of >> sending it eventually, for example. > In the case of TCP, kernel is kind enough to progress message for you, > but only if there was enough space in a kernel internal buffers. If there > was no place there, TCP BTL will also buffer messages in userspace and > will, eventually, have the same problem. > > To progress such outstanding messages additional thread is needed in > userspace. Is this what MX does? Yes - this is the bottom line, with the current problem the high cost of scheduling such threads at some sort of reasonable frequency. Rich > > -- > Gleb. > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] Release wiki pages
I literally just discovered that the trac "milestone" pages can be edited. This seems like a much better place to put the 1.1, 1.2, and 1.3 release series wiki pages. So I moved all the content and updated the links on the front wiki page. Each "1.x" milestone is now a top-level view of the entire series with links to individual milestone pages for details about that specific release. For example: https://svn.open-mpi.org/trac/ompi/milestone/Open%20MPI%201.2 https://svn.open-mpi.org/trac/ompi/milestone/Open%20MPI%201.3 Release Managers / Gatekeepers: - the pages are editable just like other wiki pages, so nothing changes there - when you edit the milestone page, I note that there's a handy "Retarget associated open tickets to milestone [X]" option for moving all leftover tickets to the next milestone -- Jeff Squyres Cisco Systems
Re: [OMPI devel] accessors to context id and message id's
George Bosilca wrote: On Nov 6, 2007, at 8:38 AM, Terry Dontje wrote: George Bosilca wrote: If I understand correctly your question, then we don't need any extension. Each request has a unique ID (from PERUSE perspective). However, if I remember well this is only half implemented in our PERUSE layer (i.e. it works only for expected requests). Looking at the peruse macros it looks to be that the unique ID is the base_req address which I imagine rarely matches between processes. That's a completely different topic. If what you need is a unique ID for each request between processes, in other words, a unique ID for each message, then here is the way to go. Use the same information as the MPI matching logic, i.e. (comm_id, remote, tag) to create an identifier for each message. It will not be unique as multiple messages can generate the same ID, but you can generate a unique ID per messages with easy tricks. I understand that one could try and rely on the order of the message being sent and received however this only works if you ultimately capture every message which is something I would like to avoid. My hope was to use something already embedded into the library not having to add more crap on top of the library. This seems like something that would be useful to any tracing utility (like vampir). However, I imagine the arguement against such a thing is that not all MPI Librarys would support such an ID thus making this a one off. The PERUSE standard requires that the ID is unique for each process, and for the lifetime of the request. It does not require that the ID be unique across processes. And this is why we're using the base_req as an ID. I understand that the PERUSE spec did not define the ID to be unique across processes which is why I was surprised by your answer. Score one for miscommunications. It would have been nice if the PERUSE committee would have provided an option for an implementation to expose message ids. --td george. This should be quite easy to fix, if someone invest few hours into it. For the context id, a user can always use the c2f function to get the fortran ID (which for Open MPI is the communicator ID). Cool, I didn't realize that. thanks, --td Thanks, george. On Nov 5, 2007, at 8:01 AM, Terry Dontje wrote: Currently in order to do message tracing one either has to rely on some error prone postprocessing of data or replicating some MPI internals up in the PMPI layer. It would help Sun's tools group (and I believe U Dresden also) if Open MPI would create a couple APIs that exoposed the following: 1. PML Message ids used for a request 2. Context id for a specific communicator I could see a couple ways of providing this information. Either by extending the PERUSE probes or creating actual functions that one would pass in a request handle or communicator handle to get the appropriate data back. This is just a thought right now which why this email is not in an RFC format. I wanted to get a feel from the community as to the interest in such APIs and if anyone may have specific issues with us providing such interfaces. If the responses seems positive I will follow this message up with an RFC. thanks, --td ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r16691
On Nov 8, 2007, at 7:57 AM, Adrian Knoth wrote: All ROMIO patches *must* be coordinated with the ROMIO maintainers. Upstream? That's the upstream patch. That was extracted from ROMIO itself? Which release? Jiri Polach has extracted the fix for this problem. Updating OMPI to a newer ROMIO version should do the trick, so we might want to revert r16693 and r16691. It would be great to upgrade to a newer version of ROMIO. Do you have the cycles to do it? If this is slated for v1.3, then I think it would be much better to back out that patch and then do a real upgrade. I have a few ideas about making the integration easier (e.g., forget the whole idea of renaming files -- it was a good idea but has a) turned out to not really be necessary in practice [even though theoretically it's still the Right Thing to Do], and b) it's a giant PITA for continual integration), and Rob Latham has indicated that he was going to put ROMIO in its own SVN which might make 1 or 2 of the integration issues easier (but we're certainly not going to grab random snapshots :-) ). There was a short mail thread about this a while ago on this list. I'd be happy to point someone in the right direction for ROMIO maintenance, but I do not have the cycles to do this at the moment. Probably not until the January/February timeframe, unfortunately... -- Jeff Squyres Cisco Systems
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r16691
On Thu, Nov 08, 2007 at 07:51:28AM -0500, Jeff Squyres wrote: [r16691] > Whoa; I'm not sure we want to apply this. Me neither. > All ROMIO patches *must* be coordinated with the ROMIO maintainers. Upstream? That's the upstream patch. Jiri Polach has extracted the fix for this problem. Updating OMPI to a newer ROMIO version should do the trick, so we might want to revert r16693 and r16691. You decide. -- Cluster and Metacomputing Working Group Friedrich-Schiller-Universität Jena, Germany private: http://adi.thur.de
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r16691
Whoa; I'm not sure we want to apply this. All ROMIO patches *must* be coordinated with the ROMIO maintainers. Otherwise this becomes a complete nightmare of logistics. There's already a few other ROMIO patches that we have consciously chosen not to apply because of the tangled issues that arise because of it, such as: - "what version of ROMIO is in OMPI?" - "do you have patch X?" - ...etc. Hence, it is best to coordinate all ROMIO patches with the upstream ROMIO maintainers. On Nov 8, 2007, at 7:44 AM, a...@osl.iu.edu wrote: Author: adi Date: 2007-11-08 07:44:10 EST (Thu, 08 Nov 2007) New Revision: 16691 URL: https://svn.open-mpi.org/trac/ompi/changeset/16691 Log: upstream patch, provided by Jiri Polach. Re #733 Text files modified: trunk/ompi/mca/io/romio/romio/adio/common/ad_open.c |32 +++ + 1 files changed, 32 insertions(+), 0 deletions(-) Modified: trunk/ompi/mca/io/romio/romio/adio/common/ad_open.c = = = = = = = = == --- trunk/ompi/mca/io/romio/romio/adio/common/ad_open.c (original) +++ trunk/ompi/mca/io/romio/romio/adio/common/ad_open.c 2007-11-08 07:44:10 EST (Thu, 08 Nov 2007) @@ -172,6 +172,37 @@ */ /* pvfs2 handles opens specially, so it is actually more efficent for that * file system if we skip this optimization */ +/* NFS handles opens especially poorly, so we cannot use this optimization + * on that FS */ +if (fd->file_system == ADIO_NFS) { +/* no optimizations for NFS: */ + if ((access_mode & ADIO_CREATE) && (access_mode & ADIO_EXCL)) { + /* the open should fail if the file exists. Only *1* process should + check this. Otherwise, if all processes try to check and the file + does not exist, one process will create the file and others who + reach later will return error. */ + if(rank == fd->hints->ranklist[0]) { +fd->access_mode = access_mode; +(*(fd->fns->ADIOI_xxx_Open))(fd, error_code); +MPI_Bcast(error_code, 1, MPI_INT, \ +fd->hints->ranklist[0], fd->comm); +/* if no error, close the file and reopen normally below */ +if (*error_code == MPI_SUCCESS) +(*(fd->fns->ADIOI_xxx_Close))(fd, error_code); + } + else MPI_Bcast(error_code, 1, MPI_INT, + fd->hints->ranklist[0], fd->comm); + if (*error_code != MPI_SUCCESS) { + goto fn_exit; + } + else { + /* turn off EXCL for real open */ + access_mode = access_mode ^ ADIO_EXCL; + } +} +} else { + + /* the actual optimized create on one, open on all */ if (access_mode & ADIO_CREATE && fd->file_system != ADIO_PVFS2) { if(rank == fd->hints->ranklist[0]) { /* remove delete_on_close flag if set */ @@ -201,6 +232,7 @@ access_mode ^= ADIO_EXCL; } } +} /* if we are doing deferred open, non-aggregators should return now */ if (fd->hints->deferred_open ) { ___ svn-full mailing list svn-f...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/svn-full -- Jeff Squyres Cisco Systems
[OMPI devel] Moving fragments in btl sm
Hi, I have a question that I shouldn't need to ask, but I'm kind of lost in the code. The btl sm component is using the circular buffers to write and read fragments (sending and receiving). In the write_to_head and read_from_tail I can only see pointers beeing set, no data being moved. So where does the actual data movement/copying take place? I'm thinking maybe a callback function existing somewhere :) Thank you for your help now and earlier. Best regards, Torje Henriksen (tor...@stud.cs.uit.no)
Re: [OMPI devel] collective problems
On Wed, Nov 07, 2007 at 11:25:43PM -0500, Patrick Geoffray wrote: > Richard Graham wrote: > > The real problem, as you and others have pointed out is the lack of > > predictable time slices for the progress engine to do its work, when relying > > on the ULP to make calls into the library... > > The real, real problem is that the BTL should handle progression at > their level, specially when the buffering is due to BTL-level flow > control. When I write something into a socket, TCP will take care of > sending it eventually, for example. In the case of TCP, kernel is kind enough to progress message for you, but only if there was enough space in a kernel internal buffers. If there was no place there, TCP BTL will also buffer messages in userspace and will, eventually, have the same problem. To progress such outstanding messages additional thread is needed in userspace. Is this what MX does? -- Gleb.
Re: [OMPI devel] collective problems
On Wed, Nov 07, 2007 at 01:16:04PM -0500, George Bosilca wrote: > > On Nov 7, 2007, at 12:51 PM, Jeff Squyres wrote: > >>> The same callback is called in both cases. In the case that you >>> described, the callback is called just a little bit deeper into the >>> recursion, when in the "normal case" it will get called from the >>> first level of the recursion. Or maybe I miss something here ... >> >> Right -- it's not the callback that is the problem. It's when the >> recursion is unwound and further up the stack you now have a stale >> request. > > That's exactly the point that I fail to see. If the request is freed in the > PML callback, then it should get release in both cases, and therefore lead > to problems all the time. Which, obviously, is not true when we do not have > this deep recursion thing going on. > > Moreover, he request management is based on the reference count. The PML > level have one ref count and the MPI level have another one. In fact, we > cannot release a request until we explicitly call ompi_request_free on it. > The place where this call happens is different between the blocking and non > blocking calls. In the non blocking case the ompi_request_free get called > from the *_test (*_wait) functions while in the blocking case it get called > directly from the MPI_Send function. > > Let me summarize: a request cannot reach a stale state without a call to > ompi_request_free. This function is never called directly from the PML > level. Therefore, the recursion depth should not have any impact on the > state of the request ! I looked at the code one more time and it seems to me now that George is absolutely right. The scenario I described cannot happen because we call ompi_request_free() at the top of the stack. I somehow had an impression that we mark internal requests as freed before calling send(). So I'll go and implement NOT_ON_WIRE extension when I'll have time for it. -- Gleb.
Re: [OMPI devel] collective problems
On Wed, Nov 07, 2007 at 09:07:23PM -0700, Brian Barrett wrote: > Personally, I'd rather just not mark MPI completion until a local > completion callback from the BTL. But others don't like that idea, so > we came up with a way for back pressure from the BTL to say "it's not > on the wire yet". This is more complicated than just not marking MPI > completion early, but why would we do something that helps real apps > at the expense of benchmarks? That would just be silly! > I fully agree with Brian here. Trying to solve the issue with current approach will introduce additional checking in the fast path and will only hurt real apps. > Brian > > On Nov 7, 2007, at 7:56 PM, Richard Graham wrote: > > > Does this mean that we don’t have a queue to store btl level > > descriptors that > > are only partially complete ? Do we do an all or nothing with > > respect to btl > > level requests at this stage ? > > > > Seems to me like we want to mark things complete at the MPI level > > ASAP, and > > that this proposal is not to do that – is this correct ? > > > > Rich > > > > > > On 11/7/07 11:26 PM, "Jeff Squyres" wrote: > > > >> On Nov 7, 2007, at 9:33 PM, Patrick Geoffray wrote: > >> > >> >> Remember that this is all in the context of Galen's proposal for > >> >> btl_send() to be able to return NOT_ON_WIRE -- meaning that the > >> send > >> >> was successful, but it has not yet been sent (e.g., openib BTL > >> >> buffered it because it ran out of credits). > >> > > >> > Sorry if I miss something obvious, but why does the PML has to be > >> > aware > >> > of the flow control situation of the BTL ? If the BTL cannot send > >> > something right away for any reason, it should be the > >> responsibility > >> > of > >> > the BTL to buffer it and to progress on it later. > >> > >> > >> That's currently the way it is. But the BTL currently only has the > >> option to say two things: > >> > >> 1. "ok, done!" -- then the PML will think that the request is > >> complete > >> 2. "doh -- error!" -- then the PML thinks that Something Bad > >> Happened(tm) > >> > >> What we really need is for the BTL to have a third option: > >> > >> 3. "not done yet!" > >> > >> So that the PML knows that the request is not yet done, but will > >> allow > >> other things to progress while we're waiting for it to complete. > >> Without this, the openib BTL currently replies "ok, done!", even when > >> it has only buffered a message (rather than actually sending it out). > >> This optimization works great (yeah, I know...) except for apps that > >> don't dip into the MPI library frequently. :-\ > >> > >> -- > >> Jeff Squyres > >> Cisco Systems > >> > >> ___ > >> devel mailing list > >> de...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> > > > > ___ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Gleb.