Re: [OMPI devel] collective problems
On 11/8/07 12:25 AM, "Patrick Geoffray" wrote: > Richard Graham wrote: >> The real problem, as you and others have pointed out is the lack of >> predictable time slices for the progress engine to do its work, when relying >> on the ULP to make calls into the library... > > The real, real problem is that the BTL should handle progression at > their level, specially when the buffering is due to BTL-level flow > control. When I write something into a socket, TCP will take care of > sending it eventually, for example. Agreed - but if it relies on the ULP to get into the progress engine, you still have the problem of a lack of predictable time slices. > > Rich, your clock is one hour late (we change to standard time a couple > of days ago...) Thanks. This is an Entourage problem I have not yet managed to figure out how to fix ... Rich > > Patrick > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] collective problems
The lengths we go to avoid progress :-) On 11/7/07 10:19 PM, "Richard Graham" wrote: > The real problem, as you and others have pointed out is the lack of > predictable time slices for the progress engine to do its work, when relying > on the ULP to make calls into the library... > > Rich > > > On 11/8/07 12:07 AM, "Brian Barrett" wrote: > >> As it stands today, the problem is that we can inject things into the >> BTL successfully that are not injected into the NIC (due to software >> flow control). Once a message is injected into the BTL, the PML marks >> completion on the MPI request. If it was a blocking send that got >> marked as complete, but the message isn't injected into the NIC/NIC >> library, and the user doesn't re-enter the MPI library for a >> considerable amount of time, then we have a problem. >> >> Personally, I'd rather just not mark MPI completion until a local >> completion callback from the BTL. But others don't like that idea, so >> we came up with a way for back pressure from the BTL to say "it's not >> on the wire yet". This is more complicated than just not marking MPI >> completion early, but why would we do something that helps real apps >> at the expense of benchmarks? That would just be silly! >> >> Brian >> >> On Nov 7, 2007, at 7:56 PM, Richard Graham wrote: >> >>> Does this mean that we don¹t have a queue to store btl level >>> descriptors that >>> are only partially complete ? Do we do an all or nothing with >>> respect to btl >>> level requests at this stage ? >>> >>> Seems to me like we want to mark things complete at the MPI level >>> ASAP, and >>> that this proposal is not to do that is this correct ? >>> >>> Rich >>> >>> >>> On 11/7/07 11:26 PM, "Jeff Squyres" wrote: >>> On Nov 7, 2007, at 9:33 PM, Patrick Geoffray wrote: >> Remember that this is all in the context of Galen's proposal for >> btl_send() to be able to return NOT_ON_WIRE -- meaning that the send >> was successful, but it has not yet been sent (e.g., openib BTL >> buffered it because it ran out of credits). > > Sorry if I miss something obvious, but why does the PML has to be > aware > of the flow control situation of the BTL ? If the BTL cannot send > something right away for any reason, it should be the responsibility > of > the BTL to buffer it and to progress on it later. That's currently the way it is. But the BTL currently only has the option to say two things: 1. "ok, done!" -- then the PML will think that the request is complete 2. "doh -- error!" -- then the PML thinks that Something Bad Happened(tm) What we really need is for the BTL to have a third option: 3. "not done yet!" So that the PML knows that the request is not yet done, but will allow other things to progress while we're waiting for it to complete. Without this, the openib BTL currently replies "ok, done!", even when it has only buffered a message (rather than actually sending it out). This optimization works great (yeah, I know...) except for apps that don't dip into the MPI library frequently. :-\ -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] collective problems
Richard Graham wrote: The real problem, as you and others have pointed out is the lack of predictable time slices for the progress engine to do its work, when relying on the ULP to make calls into the library... The real, real problem is that the BTL should handle progression at their level, specially when the buffering is due to BTL-level flow control. When I write something into a socket, TCP will take care of sending it eventually, for example. Rich, your clock is one hour late (we change to standard time a couple of days ago...) Patrick
Re: [OMPI devel] collective problems
The real problem, as you and others have pointed out is the lack of predictable time slices for the progress engine to do its work, when relying on the ULP to make calls into the library... Rich On 11/8/07 12:07 AM, "Brian Barrett" wrote: > As it stands today, the problem is that we can inject things into the > BTL successfully that are not injected into the NIC (due to software > flow control). Once a message is injected into the BTL, the PML marks > completion on the MPI request. If it was a blocking send that got > marked as complete, but the message isn't injected into the NIC/NIC > library, and the user doesn't re-enter the MPI library for a > considerable amount of time, then we have a problem. > > Personally, I'd rather just not mark MPI completion until a local > completion callback from the BTL. But others don't like that idea, so > we came up with a way for back pressure from the BTL to say "it's not > on the wire yet". This is more complicated than just not marking MPI > completion early, but why would we do something that helps real apps > at the expense of benchmarks? That would just be silly! > > Brian > > On Nov 7, 2007, at 7:56 PM, Richard Graham wrote: > >> Does this mean that we don¹t have a queue to store btl level >> descriptors that >> are only partially complete ? Do we do an all or nothing with >> respect to btl >> level requests at this stage ? >> >> Seems to me like we want to mark things complete at the MPI level >> ASAP, and >> that this proposal is not to do that is this correct ? >> >> Rich >> >> >> On 11/7/07 11:26 PM, "Jeff Squyres" wrote: >> >>> On Nov 7, 2007, at 9:33 PM, Patrick Geoffray wrote: >>> > Remember that this is all in the context of Galen's proposal for > btl_send() to be able to return NOT_ON_WIRE -- meaning that the >>> send > was successful, but it has not yet been sent (e.g., openib BTL > buffered it because it ran out of credits). Sorry if I miss something obvious, but why does the PML has to be aware of the flow control situation of the BTL ? If the BTL cannot send something right away for any reason, it should be the >>> responsibility of the BTL to buffer it and to progress on it later. >>> >>> >>> That's currently the way it is. But the BTL currently only has the >>> option to say two things: >>> >>> 1. "ok, done!" -- then the PML will think that the request is >>> complete >>> 2. "doh -- error!" -- then the PML thinks that Something Bad >>> Happened(tm) >>> >>> What we really need is for the BTL to have a third option: >>> >>> 3. "not done yet!" >>> >>> So that the PML knows that the request is not yet done, but will >>> allow >>> other things to progress while we're waiting for it to complete. >>> Without this, the openib BTL currently replies "ok, done!", even when >>> it has only buffered a message (rather than actually sending it out). >>> This optimization works great (yeah, I know...) except for apps that >>> don't dip into the MPI library frequently. :-\ >>> >>> -- >>> Jeff Squyres >>> Cisco Systems >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] collective problems
As it stands today, the problem is that we can inject things into the BTL successfully that are not injected into the NIC (due to software flow control). Once a message is injected into the BTL, the PML marks completion on the MPI request. If it was a blocking send that got marked as complete, but the message isn't injected into the NIC/NIC library, and the user doesn't re-enter the MPI library for a considerable amount of time, then we have a problem. Personally, I'd rather just not mark MPI completion until a local completion callback from the BTL. But others don't like that idea, so we came up with a way for back pressure from the BTL to say "it's not on the wire yet". This is more complicated than just not marking MPI completion early, but why would we do something that helps real apps at the expense of benchmarks? That would just be silly! Brian On Nov 7, 2007, at 7:56 PM, Richard Graham wrote: Does this mean that we don’t have a queue to store btl level descriptors that are only partially complete ? Do we do an all or nothing with respect to btl level requests at this stage ? Seems to me like we want to mark things complete at the MPI level ASAP, and that this proposal is not to do that – is this correct ? Rich On 11/7/07 11:26 PM, "Jeff Squyres" wrote: On Nov 7, 2007, at 9:33 PM, Patrick Geoffray wrote: >> Remember that this is all in the context of Galen's proposal for >> btl_send() to be able to return NOT_ON_WIRE -- meaning that the send >> was successful, but it has not yet been sent (e.g., openib BTL >> buffered it because it ran out of credits). > > Sorry if I miss something obvious, but why does the PML has to be > aware > of the flow control situation of the BTL ? If the BTL cannot send > something right away for any reason, it should be the responsibility > of > the BTL to buffer it and to progress on it later. That's currently the way it is. But the BTL currently only has the option to say two things: 1. "ok, done!" -- then the PML will think that the request is complete 2. "doh -- error!" -- then the PML thinks that Something Bad Happened(tm) What we really need is for the BTL to have a third option: 3. "not done yet!" So that the PML knows that the request is not yet done, but will allow other things to progress while we're waiting for it to complete. Without this, the openib BTL currently replies "ok, done!", even when it has only buffered a message (rather than actually sending it out). This optimization works great (yeah, I know...) except for apps that don't dip into the MPI library frequently. :-\ -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] collective problems
Does this mean that we don¹t have a queue to store btl level descriptors that are only partially complete ? Do we do an all or nothing with respect to btl level requests at this stage ? Seems to me like we want to mark things complete at the MPI level ASAP, and that this proposal is not to do that is this correct ? Rich On 11/7/07 11:26 PM, "Jeff Squyres" wrote: > On Nov 7, 2007, at 9:33 PM, Patrick Geoffray wrote: > >>> >> Remember that this is all in the context of Galen's proposal for >>> >> btl_send() to be able to return NOT_ON_WIRE -- meaning that the send >>> >> was successful, but it has not yet been sent (e.g., openib BTL >>> >> buffered it because it ran out of credits). >> > >> > Sorry if I miss something obvious, but why does the PML has to be >> > aware >> > of the flow control situation of the BTL ? If the BTL cannot send >> > something right away for any reason, it should be the responsibility >> > of >> > the BTL to buffer it and to progress on it later. > > > That's currently the way it is. But the BTL currently only has the > option to say two things: > > 1. "ok, done!" -- then the PML will think that the request is complete > 2. "doh -- error!" -- then the PML thinks that Something Bad > Happened(tm) > > What we really need is for the BTL to have a third option: > > 3. "not done yet!" > > So that the PML knows that the request is not yet done, but will allow > other things to progress while we're waiting for it to complete. > Without this, the openib BTL currently replies "ok, done!", even when > it has only buffered a message (rather than actually sending it out). > This optimization works great (yeah, I know...) except for apps that > don't dip into the MPI library frequently. :-\ > > -- > Jeff Squyres > Cisco Systems > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] collective problems
On Nov 7, 2007, at 9:33 PM, Patrick Geoffray wrote: Remember that this is all in the context of Galen's proposal for btl_send() to be able to return NOT_ON_WIRE -- meaning that the send was successful, but it has not yet been sent (e.g., openib BTL buffered it because it ran out of credits). Sorry if I miss something obvious, but why does the PML has to be aware of the flow control situation of the BTL ? If the BTL cannot send something right away for any reason, it should be the responsibility of the BTL to buffer it and to progress on it later. That's currently the way it is. But the BTL currently only has the option to say two things: 1. "ok, done!" -- then the PML will think that the request is complete 2. "doh -- error!" -- then the PML thinks that Something Bad Happened(tm) What we really need is for the BTL to have a third option: 3. "not done yet!" So that the PML knows that the request is not yet done, but will allow other things to progress while we're waiting for it to complete. Without this, the openib BTL currently replies "ok, done!", even when it has only buffered a message (rather than actually sending it out). This optimization works great (yeah, I know...) except for apps that don't dip into the MPI library frequently. :-\ -- Jeff Squyres Cisco Systems
Re: [OMPI devel] collective problems
Jeff Squyres wrote: This is not a problem in the current code base. Remember that this is all in the context of Galen's proposal for btl_send() to be able to return NOT_ON_WIRE -- meaning that the send was successful, but it has not yet been sent (e.g., openib BTL buffered it because it ran out of credits). Sorry if I miss something obvious, but why does the PML has to be aware of the flow control situation of the BTL ? If the BTL cannot send something right away for any reason, it should be the responsibility of the BTL to buffer it and to progress on it later. Patrick
Re: [OMPI devel] Multiworld MCA parameter values broken
What changed is that we never passed mca params to the orted before - they always went to the app, but it's the orted that has the issue. There is a bug ticket thread on this subject - I forget the number immediately. Basically, the problem was that we cannot generally pass the local environment to the orteds when we launch them. However, people needed various mca params to get to the orteds to control their behavior. The only way to resolve that problem was to pass the params via the command line, which is what was done. Except for a very few cases, all of our mca params are single values that do not include spaces, so this is not a problem that is causing widespread issues. As I said, I already had to deal with one special case that didn't involve spaces, but did have special characters that required quoting, which identified the larger problem of dealing with quoted strings. I have no objection to a more general fix. Like I said in my note, though, the general fix will take a larger effort. If someone is willing to do so, that is fine with me - I was only offering solutions that would fill the interim time as I haven't heard anyone step up to say they would fix it anytime soon. Please feel free to jump in and volunteer! ;-) I'm willing to put the quotes around things if you will fix the mca cmd line parser to cleanly remove them on the other end. Ralph On 11/7/07 5:50 PM, "Tim Prins" wrote: > I'm curious what changed to make this a problem. How were we passing mca param > from the base to the app before, and why did it change? > > I think that options 1 & 2 below are no good, since we, in general, allow > string mca params to have spaces (as far as I understand it). So a more > general approach is needed. > > Tim > > On Wednesday 07 November 2007 10:40:45 am Ralph H Castain wrote: >> Sorry for delay - wasn't ignoring the issue. >> >> There are several fixes to this problem - ranging in order from least to >> most work: >> >> 1. just alias "ssh" to be "ssh -Y" and run without setting the mca param. >> It won't affect anything on the backend because the daemon/procs don't use >> ssh. >> >> 2. include "pls_rsh_agent" in the array of mca params not to be passed to >> the orted in orte/mca/pls/base/pls_base_general_support_fns.c, the >> orte_pls_base_orted_append_basic_args function. This would fix the specific >> problem cited here, but I admit that listing every such param by name would >> get tedious. >> >> 3. we could easily detect that a "problem" character was in the mca param >> value when we add it to the orted's argv, and then put "" around it. The >> problem, however, is that the mca param parser on the far end doesn't >> remove those "" from the resulting string. At least, I spent over a day >> fighting with a problem only to discover that was happening. Could be an >> error in the way I was doing things, or could be a real characteristic of >> the parser. Anyway, we would have to ensure that the parser removes any >> surrounding "" before passing along the param value or this won't work. >> >> Ralph >> >> On 11/5/07 12:10 PM, "Tim Prins" wrote: >>> Hi, >>> >>> Commit 16364 broke things when using multiword mca param values. For >>> instance: >>> >>> mpirun --debug-daemons -mca orte_debug 1 -mca pls rsh -mca pls_rsh_agent >>> "ssh -Y" xterm >>> >>> Will crash and burn, because the value "ssh -Y" is being stored into the >>> argv orted_cmd_line in orterun.c:1506. This is then added to the launch >>> command for the orted: >>> >>> /usr/bin/ssh -Y odin004 PATH=/san/homedirs/tprins/usr/rsl/bin:$PATH ; >>> export PATH ; >>> LD_LIBRARY_PATH=/san/homedirs/tprins/usr/rsl/lib:$LD_LIBRARY_PATH ; >>> export LD_LIBRARY_PATH ; /san/homedirs/tprins/usr/rsl/bin/orted --debug >>> --debug-daemons --name 0.1 --num_procs 2 --vpid_start 0 --nodename >>> odin004 --universe tpr...@odin.cs.indiana.edu:default-universe-27872 >>> --nsreplica >>> "0.0;tcp://129.79.240.100:40907;tcp6://2001:18e8:2:240:2e0:81ff:fe2d:21a0 >>> :4090 8" >>> --gprreplica >>> "0.0;tcp://129.79.240.100:40907;tcp6://2001:18e8:2:240:2e0:81ff:fe2d:21a0 >>> :4090 8" >>> -mca orte_debug 1 -mca pls_rsh_agent ssh -Y -mca >>> mca_base_param_file_path >>> /u/tprins/usr/rsl/share/openmpi/amca-param-sets:/san/homedirs/tprins/rsl/ >>> examp les >>> -mca mca_base_param_file_path_force /san/homedirs/tprins/rsl/examples >>> >>> Notice that in this command we now have "-mca pls_rsh_agent ssh -Y". So >>> the quotes have been lost, as we die a horrible death. >>> >>> So we need to add the quotes back in somehow, or pass these options >>> differently. I'm not sure what the best way to fix this. >>> >>> Thanks, >>> >>> Tim > >
Re: [OMPI devel] Multiworld MCA parameter values broken
I'm curious what changed to make this a problem. How were we passing mca param from the base to the app before, and why did it change? I think that options 1 & 2 below are no good, since we, in general, allow string mca params to have spaces (as far as I understand it). So a more general approach is needed. Tim On Wednesday 07 November 2007 10:40:45 am Ralph H Castain wrote: > Sorry for delay - wasn't ignoring the issue. > > There are several fixes to this problem - ranging in order from least to > most work: > > 1. just alias "ssh" to be "ssh -Y" and run without setting the mca param. > It won't affect anything on the backend because the daemon/procs don't use > ssh. > > 2. include "pls_rsh_agent" in the array of mca params not to be passed to > the orted in orte/mca/pls/base/pls_base_general_support_fns.c, the > orte_pls_base_orted_append_basic_args function. This would fix the specific > problem cited here, but I admit that listing every such param by name would > get tedious. > > 3. we could easily detect that a "problem" character was in the mca param > value when we add it to the orted's argv, and then put "" around it. The > problem, however, is that the mca param parser on the far end doesn't > remove those "" from the resulting string. At least, I spent over a day > fighting with a problem only to discover that was happening. Could be an > error in the way I was doing things, or could be a real characteristic of > the parser. Anyway, we would have to ensure that the parser removes any > surrounding "" before passing along the param value or this won't work. > > Ralph > > On 11/5/07 12:10 PM, "Tim Prins" wrote: > > Hi, > > > > Commit 16364 broke things when using multiword mca param values. For > > instance: > > > > mpirun --debug-daemons -mca orte_debug 1 -mca pls rsh -mca pls_rsh_agent > > "ssh -Y" xterm > > > > Will crash and burn, because the value "ssh -Y" is being stored into the > > argv orted_cmd_line in orterun.c:1506. This is then added to the launch > > command for the orted: > > > > /usr/bin/ssh -Y odin004 PATH=/san/homedirs/tprins/usr/rsl/bin:$PATH ; > > export PATH ; > > LD_LIBRARY_PATH=/san/homedirs/tprins/usr/rsl/lib:$LD_LIBRARY_PATH ; > > export LD_LIBRARY_PATH ; /san/homedirs/tprins/usr/rsl/bin/orted --debug > > --debug-daemons --name 0.1 --num_procs 2 --vpid_start 0 --nodename > > odin004 --universe tpr...@odin.cs.indiana.edu:default-universe-27872 > > --nsreplica > > "0.0;tcp://129.79.240.100:40907;tcp6://2001:18e8:2:240:2e0:81ff:fe2d:21a0 > >:4090 8" > > --gprreplica > > "0.0;tcp://129.79.240.100:40907;tcp6://2001:18e8:2:240:2e0:81ff:fe2d:21a0 > >:4090 8" > > -mca orte_debug 1 -mca pls_rsh_agent ssh -Y -mca > > mca_base_param_file_path > > /u/tprins/usr/rsl/share/openmpi/amca-param-sets:/san/homedirs/tprins/rsl/ > >examp les > > -mca mca_base_param_file_path_force /san/homedirs/tprins/rsl/examples > > > > Notice that in this command we now have "-mca pls_rsh_agent ssh -Y". So > > the quotes have been lost, as we die a horrible death. > > > > So we need to add the quotes back in somehow, or pass these options > > differently. I'm not sure what the best way to fix this. > > > > Thanks, > > > > Tim
Re: [OMPI devel] accessors to context id and message id's
On Nov 6, 2007, at 8:38 AM, Terry Dontje wrote: George Bosilca wrote: If I understand correctly your question, then we don't need any extension. Each request has a unique ID (from PERUSE perspective). However, if I remember well this is only half implemented in our PERUSE layer (i.e. it works only for expected requests). Looking at the peruse macros it looks to be that the unique ID is the base_req address which I imagine rarely matches between processes. That's a completely different topic. If what you need is a unique ID for each request between processes, in other words, a unique ID for each message, then here is the way to go. Use the same information as the MPI matching logic, i.e. (comm_id, remote, tag) to create an identifier for each message. It will not be unique as multiple messages can generate the same ID, but you can generate a unique ID per messages with easy tricks. The PERUSE standard requires that the ID is unique for each process, and for the lifetime of the request. It does not require that the ID be unique across processes. And this is why we're using the base_req as an ID. george. This should be quite easy to fix, if someone invest few hours into it. For the context id, a user can always use the c2f function to get the fortran ID (which for Open MPI is the communicator ID). Cool, I didn't realize that. thanks, --td Thanks, george. On Nov 5, 2007, at 8:01 AM, Terry Dontje wrote: Currently in order to do message tracing one either has to rely on some error prone postprocessing of data or replicating some MPI internals up in the PMPI layer. It would help Sun's tools group (and I believe U Dresden also) if Open MPI would create a couple APIs that exoposed the following: 1. PML Message ids used for a request 2. Context id for a specific communicator I could see a couple ways of providing this information. Either by extending the PERUSE probes or creating actual functions that one would pass in a request handle or communicator handle to get the appropriate data back. This is just a thought right now which why this email is not in an RFC format. I wanted to get a feel from the community as to the interest in such APIs and if anyone may have specific issues with us providing such interfaces. If the responses seems positive I will follow this message up with an RFC. thanks, --td ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI devel] collective problems
This is not a problem in the current code base. Remember that this is all in the context of Galen's proposal for btl_send() to be able to return NOT_ON_WIRE -- meaning that the send was successful, but it has not yet been sent (e.g., openib BTL buffered it because it ran out of credits). Read these two messages again to get the context: http://www.open-mpi.org/community/lists/devel/2007/10/2486.php http://www.open-mpi.org/community/lists/devel/2007/10/2487.php Gleb describes the recursive problem (paired with the concept of NOT_ON_WIRE) nicely in his post. Make sense? On Nov 7, 2007, at 1:16 PM, George Bosilca wrote: On Nov 7, 2007, at 12:51 PM, Jeff Squyres wrote: The same callback is called in both cases. In the case that you described, the callback is called just a little bit deeper into the recursion, when in the "normal case" it will get called from the first level of the recursion. Or maybe I miss something here ... Right -- it's not the callback that is the problem. It's when the recursion is unwound and further up the stack you now have a stale request. That's exactly the point that I fail to see. If the request is freed in the PML callback, then it should get release in both cases, and therefore lead to problems all the time. Which, obviously, is not true when we do not have this deep recursion thing going on. Moreover, he request management is based on the reference count. The PML level have one ref count and the MPI level have another one. In fact, we cannot release a request until we explicitly call ompi_request_free on it. The place where this call happens is different between the blocking and non blocking calls. In the non blocking case the ompi_request_free get called from the *_test (*_wait) functions while in the blocking case it get called directly from the MPI_Send function. Let me summarize: a request cannot reach a stale state without a call to ompi_request_free. This function is never called directly from the PML level. Therefore, the recursion depth should not have any impact on the state of the request ! Is there a simple test case I can run in order to trigger this strange behavior ? Thanks, george. george. This is *only* a problem for requests that are involved from the current top-level MPI call. Request from prior calls to MPI functions (e.g., a request from a prior call to MPI_ISEND) are ok because a) we've already done the Right Things to ensure the safety of that request, and b) that request is not on the recursive stack anywhere to become stale as the recursion unwinds. Right? If so, Galen proposes the following: 1. in conjunction with the NOT_ON_WIRE proposal... 2. make a new PML request flag DONT_FREE_ME (or some better name :-) ). 3. blocking MPI_SEND/MPI_RECV calls will set this flag (or, more specifically, the top of the PML calls for blocking send/receive) right when the request is allocated (i.e., before calling btl_send()). 4. when the PML is called for completion on this request, it will do all the stuff that it needs to effect completion -- but then it will see the DONT_FREE_ME flag and not actually free the request. Obviously, if DONT_FREE_ME is *not* set, then the PML does what it does today: it frees the request. 5. the top-level PML call will eventually complete: 5a. For blocking PML calls (e.g., corresponding to MPI_SEND and MPI_RECV), the request can be unconditionally freed. 5b. For non-blocking PML calls (e.g., corresponding to MPI_ISEND), only free the request if it was completed. Note that with this scheme, it becomes irrelevant as to whether the PML completion call is invoked on the first descent into the BTL or recursively via opal_progress. How does that sound? If that all works, it might be beneficial to put this back to the 1.2 branch because there are definitely apps that would benefit from it. On Oct 23, 2007, at 10:19 AM, Gleb Natapov wrote: So this problem goes WAY back.. The problem here is that the PML marks MPI completion just prior to calling btl_send and then returns to the user. This wouldn't be a problem if the BTL then did something, but in the case of OpenIB this fragment may not actually be on the wire (the joys of user level flow control). One solution that we proposed was to allow btl_send to return either OMPI_SUCCESS or OMPI_NOT_ON_WIRE. OMPI_NOT_ON_WIRE would allow the PML to not mark MPI completion of the fragment and then MPI_WAITALL and others will do there job properly. I even implemented this once, but there is a problem. Currently we mark request as completed on MPI level and then do btl_send(). Whenever IB completion will happen the request will be marked as complete on PML level and freed. The fix requires to change the order like this: Call btl_send(), check return value from BTL and mark request complete as necessary. The problem is that because we allow BTL to call opal_progress() internally t
Re: [OMPI devel] collective problems
On Nov 7, 2007, at 12:51 PM, Jeff Squyres wrote: The same callback is called in both cases. In the case that you described, the callback is called just a little bit deeper into the recursion, when in the "normal case" it will get called from the first level of the recursion. Or maybe I miss something here ... Right -- it's not the callback that is the problem. It's when the recursion is unwound and further up the stack you now have a stale request. That's exactly the point that I fail to see. If the request is freed in the PML callback, then it should get release in both cases, and therefore lead to problems all the time. Which, obviously, is not true when we do not have this deep recursion thing going on. Moreover, he request management is based on the reference count. The PML level have one ref count and the MPI level have another one. In fact, we cannot release a request until we explicitly call ompi_request_free on it. The place where this call happens is different between the blocking and non blocking calls. In the non blocking case the ompi_request_free get called from the *_test (*_wait) functions while in the blocking case it get called directly from the MPI_Send function. Let me summarize: a request cannot reach a stale state without a call to ompi_request_free. This function is never called directly from the PML level. Therefore, the recursion depth should not have any impact on the state of the request ! Is there a simple test case I can run in order to trigger this strange behavior ? Thanks, george. george. This is *only* a problem for requests that are involved from the current top-level MPI call. Request from prior calls to MPI functions (e.g., a request from a prior call to MPI_ISEND) are ok because a) we've already done the Right Things to ensure the safety of that request, and b) that request is not on the recursive stack anywhere to become stale as the recursion unwinds. Right? If so, Galen proposes the following: 1. in conjunction with the NOT_ON_WIRE proposal... 2. make a new PML request flag DONT_FREE_ME (or some better name :-) ). 3. blocking MPI_SEND/MPI_RECV calls will set this flag (or, more specifically, the top of the PML calls for blocking send/receive) right when the request is allocated (i.e., before calling btl_send()). 4. when the PML is called for completion on this request, it will do all the stuff that it needs to effect completion -- but then it will see the DONT_FREE_ME flag and not actually free the request. Obviously, if DONT_FREE_ME is *not* set, then the PML does what it does today: it frees the request. 5. the top-level PML call will eventually complete: 5a. For blocking PML calls (e.g., corresponding to MPI_SEND and MPI_RECV), the request can be unconditionally freed. 5b. For non-blocking PML calls (e.g., corresponding to MPI_ISEND), only free the request if it was completed. Note that with this scheme, it becomes irrelevant as to whether the PML completion call is invoked on the first descent into the BTL or recursively via opal_progress. How does that sound? If that all works, it might be beneficial to put this back to the 1.2 branch because there are definitely apps that would benefit from it. On Oct 23, 2007, at 10:19 AM, Gleb Natapov wrote: So this problem goes WAY back.. The problem here is that the PML marks MPI completion just prior to calling btl_send and then returns to the user. This wouldn't be a problem if the BTL then did something, but in the case of OpenIB this fragment may not actually be on the wire (the joys of user level flow control). One solution that we proposed was to allow btl_send to return either OMPI_SUCCESS or OMPI_NOT_ON_WIRE. OMPI_NOT_ON_WIRE would allow the PML to not mark MPI completion of the fragment and then MPI_WAITALL and others will do there job properly. I even implemented this once, but there is a problem. Currently we mark request as completed on MPI level and then do btl_send(). Whenever IB completion will happen the request will be marked as complete on PML level and freed. The fix requires to change the order like this: Call btl_send(), check return value from BTL and mark request complete as necessary. The problem is that because we allow BTL to call opal_progress() internally the request may be already completed on MPI and MPL levels and freed before return from the call to btl_send(). I did a code review to see how hard it will be to get rid of recursion in Open MPI and I think this is doable. We have to disallow calling progress() (or other functions that may call progress() internally) from BTL and from ULP callbacks that are called by BTL. There is no much places that break this law. The main offenders are calls to FREE_LIST_WAIT(), but those never actually call progress if they can grow without limit and this is the most common use of FREE_LIST_WAIT() so they may be safely changed to FREE_LIST_GET(). After we will solve re
Re: [OMPI devel] collective problems
On Nov 7, 2007, at 12:29 PM, George Bosilca wrote: I finally talked with Galen and Don about this issue in depth. Our understanding is that the "request may get freed before recursion unwinds" issue is *only* a problem within the context of a single MPI call (e.g., MPI_SEND). Is that right? I wonder how this happens ? Specifically, if in an MPI_SEND, the BTL ends up buffering the message and setting early completion, but then recurses into opal_progress() and ends up sending the message and freeing the request during the recursion, then when the recursion unwinds, the original caller will have a stale request. The same callback is called in both cases. In the case that you described, the callback is called just a little bit deeper into the recursion, when in the "normal case" it will get called from the first level of the recursion. Or maybe I miss something here ... Right -- it's not the callback that is the problem. It's when the recursion is unwound and further up the stack you now have a stale request. george. This is *only* a problem for requests that are involved from the current top-level MPI call. Request from prior calls to MPI functions (e.g., a request from a prior call to MPI_ISEND) are ok because a) we've already done the Right Things to ensure the safety of that request, and b) that request is not on the recursive stack anywhere to become stale as the recursion unwinds. Right? If so, Galen proposes the following: 1. in conjunction with the NOT_ON_WIRE proposal... 2. make a new PML request flag DONT_FREE_ME (or some better name :-) ). 3. blocking MPI_SEND/MPI_RECV calls will set this flag (or, more specifically, the top of the PML calls for blocking send/receive) right when the request is allocated (i.e., before calling btl_send()). 4. when the PML is called for completion on this request, it will do all the stuff that it needs to effect completion -- but then it will see the DONT_FREE_ME flag and not actually free the request. Obviously, if DONT_FREE_ME is *not* set, then the PML does what it does today: it frees the request. 5. the top-level PML call will eventually complete: 5a. For blocking PML calls (e.g., corresponding to MPI_SEND and MPI_RECV), the request can be unconditionally freed. 5b. For non-blocking PML calls (e.g., corresponding to MPI_ISEND), only free the request if it was completed. Note that with this scheme, it becomes irrelevant as to whether the PML completion call is invoked on the first descent into the BTL or recursively via opal_progress. How does that sound? If that all works, it might be beneficial to put this back to the 1.2 branch because there are definitely apps that would benefit from it. On Oct 23, 2007, at 10:19 AM, Gleb Natapov wrote: So this problem goes WAY back.. The problem here is that the PML marks MPI completion just prior to calling btl_send and then returns to the user. This wouldn't be a problem if the BTL then did something, but in the case of OpenIB this fragment may not actually be on the wire (the joys of user level flow control). One solution that we proposed was to allow btl_send to return either OMPI_SUCCESS or OMPI_NOT_ON_WIRE. OMPI_NOT_ON_WIRE would allow the PML to not mark MPI completion of the fragment and then MPI_WAITALL and others will do there job properly. I even implemented this once, but there is a problem. Currently we mark request as completed on MPI level and then do btl_send(). Whenever IB completion will happen the request will be marked as complete on PML level and freed. The fix requires to change the order like this: Call btl_send(), check return value from BTL and mark request complete as necessary. The problem is that because we allow BTL to call opal_progress() internally the request may be already completed on MPI and MPL levels and freed before return from the call to btl_send(). I did a code review to see how hard it will be to get rid of recursion in Open MPI and I think this is doable. We have to disallow calling progress() (or other functions that may call progress() internally) from BTL and from ULP callbacks that are called by BTL. There is no much places that break this law. The main offenders are calls to FREE_LIST_WAIT(), but those never actually call progress if they can grow without limit and this is the most common use of FREE_LIST_WAIT() so they may be safely changed to FREE_LIST_GET(). After we will solve recursion problem the fix to the problem will be a couple of lines of code. - Galen On 10/11/07 11:26 AM, "Gleb Natapov" wrote: On Fri, Oct 05, 2007 at 09:43:44AM +0200, Jeff Squyres wrote: David -- Gleb and I just actively re-looked at this problem yesterday; we think it's related to https://svn.open-mpi.org/trac/ompi/ticket/ 1015. We previously thought this ticket was a different problem, but our analysis yesterday shows that it could be a real problem in the openib BTL or ob1 PML (kinda think it's the openib
Re: [OMPI devel] collective problems
On Nov 7, 2007, at 11:06 AM, Jeff Squyres wrote: Gleb -- I finally talked with Galen and Don about this issue in depth. Our understanding is that the "request may get freed before recursion unwinds" issue is *only* a problem within the context of a single MPI call (e.g., MPI_SEND). Is that right? I wonder how this happens ? Specifically, if in an MPI_SEND, the BTL ends up buffering the message and setting early completion, but then recurses into opal_progress() and ends up sending the message and freeing the request during the recursion, then when the recursion unwinds, the original caller will have a stale request. The same callback is called in both cases. In the case that you described, the callback is called just a little bit deeper into the recursion, when in the "normal case" it will get called from the first level of the recursion. Or maybe I miss something here ... george. This is *only* a problem for requests that are involved from the current top-level MPI call. Request from prior calls to MPI functions (e.g., a request from a prior call to MPI_ISEND) are ok because a) we've already done the Right Things to ensure the safety of that request, and b) that request is not on the recursive stack anywhere to become stale as the recursion unwinds. Right? If so, Galen proposes the following: 1. in conjunction with the NOT_ON_WIRE proposal... 2. make a new PML request flag DONT_FREE_ME (or some better name :-) ). 3. blocking MPI_SEND/MPI_RECV calls will set this flag (or, more specifically, the top of the PML calls for blocking send/receive) right when the request is allocated (i.e., before calling btl_send()). 4. when the PML is called for completion on this request, it will do all the stuff that it needs to effect completion -- but then it will see the DONT_FREE_ME flag and not actually free the request. Obviously, if DONT_FREE_ME is *not* set, then the PML does what it does today: it frees the request. 5. the top-level PML call will eventually complete: 5a. For blocking PML calls (e.g., corresponding to MPI_SEND and MPI_RECV), the request can be unconditionally freed. 5b. For non-blocking PML calls (e.g., corresponding to MPI_ISEND), only free the request if it was completed. Note that with this scheme, it becomes irrelevant as to whether the PML completion call is invoked on the first descent into the BTL or recursively via opal_progress. How does that sound? If that all works, it might be beneficial to put this back to the 1.2 branch because there are definitely apps that would benefit from it. On Oct 23, 2007, at 10:19 AM, Gleb Natapov wrote: So this problem goes WAY back.. The problem here is that the PML marks MPI completion just prior to calling btl_send and then returns to the user. This wouldn't be a problem if the BTL then did something, but in the case of OpenIB this fragment may not actually be on the wire (the joys of user level flow control). One solution that we proposed was to allow btl_send to return either OMPI_SUCCESS or OMPI_NOT_ON_WIRE. OMPI_NOT_ON_WIRE would allow the PML to not mark MPI completion of the fragment and then MPI_WAITALL and others will do there job properly. I even implemented this once, but there is a problem. Currently we mark request as completed on MPI level and then do btl_send(). Whenever IB completion will happen the request will be marked as complete on PML level and freed. The fix requires to change the order like this: Call btl_send(), check return value from BTL and mark request complete as necessary. The problem is that because we allow BTL to call opal_progress() internally the request may be already completed on MPI and MPL levels and freed before return from the call to btl_send(). I did a code review to see how hard it will be to get rid of recursion in Open MPI and I think this is doable. We have to disallow calling progress() (or other functions that may call progress() internally) from BTL and from ULP callbacks that are called by BTL. There is no much places that break this law. The main offenders are calls to FREE_LIST_WAIT(), but those never actually call progress if they can grow without limit and this is the most common use of FREE_LIST_WAIT() so they may be safely changed to FREE_LIST_GET(). After we will solve recursion problem the fix to the problem will be a couple of lines of code. - Galen On 10/11/07 11:26 AM, "Gleb Natapov" wrote: On Fri, Oct 05, 2007 at 09:43:44AM +0200, Jeff Squyres wrote: David -- Gleb and I just actively re-looked at this problem yesterday; we think it's related to https://svn.open-mpi.org/trac/ompi/ticket/ 1015. We previously thought this ticket was a different problem, but our analysis yesterday shows that it could be a real problem in the openib BTL or ob1 PML (kinda think it's the openib btl because it doesn't seem to happen on other networks, but who knows...). Gleb is investigating. Here is the result of the investigation. The problem is di
[OMPI devel] Incorrect one-sided test
Hi all - Lisa Glendenning, who's working on a Portals one-sided component, discovered that the test onesided/test_start1.c in our repository is incorrect. It assumes that MPI_Win_start is non-blocking, but the standard says that "MPI_WIN_START is allowed to block until the corresponding MPI_WIN_POST calls are executed". The pt2pt and rdma components did not block, so the test error did not show up with those components. I've fixed the test in r1223, but thought I'd let everyone know I changed one of our conformance tests. Brian
Re: [OMPI devel] v1.2 branch mpi_preconnect_all
Don, Galen, and I talked about this in depth on the phone today and think that it is a symptom of the same issue discussed in this thread: http://www.open-mpi.org/community/lists/devel/2007/10/2382.php Note my message in that thread from just a few minutes ago: http://www.open-mpi.org/community/lists/devel/2007/11/2561.php We think that the proposed solution to that thread will also fix the mpi_preconnect_all issues (i.e., the ping-pong that Don proposes in his mail should not be necessary). On Oct 17, 2007, at 10:54 AM, Don Kerr wrote: All, I have noticed an issue in the 1.2 branch when mpi_preconnect_all=1. The one way communication pattern (ranks either send or receive from each other) may not fully establish connection with peers. Example, if I have a 3 process mpi job and rank 0 does not do any mpi communication after MPI_Init() the other ranks attempts to connect will not be progressed (I have seen this with tcp and udapl). The preconnect pattern has changed slightly in the trunk but essentially it is still a one way communication, either send or receive with each rank. So although the issue I see in the 1.2 branch does not appear in the trunk I wonder if this will show up again. An alternative to the preconnect pattern that comes to mind would be to perform a send and receive between all ranks to ensure that connections have been fully established. Does anyone have thoughts or comments on this, or reasons not to have all ranks send and receive from all? -DON ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] collective problems
Gleb -- I finally talked with Galen and Don about this issue in depth. Our understanding is that the "request may get freed before recursion unwinds" issue is *only* a problem within the context of a single MPI call (e.g., MPI_SEND). Is that right? Specifically, if in an MPI_SEND, the BTL ends up buffering the message and setting early completion, but then recurses into opal_progress() and ends up sending the message and freeing the request during the recursion, then when the recursion unwinds, the original caller will have a stale request. This is *only* a problem for requests that are involved from the current top-level MPI call. Request from prior calls to MPI functions (e.g., a request from a prior call to MPI_ISEND) are ok because a) we've already done the Right Things to ensure the safety of that request, and b) that request is not on the recursive stack anywhere to become stale as the recursion unwinds. Right? If so, Galen proposes the following: 1. in conjunction with the NOT_ON_WIRE proposal... 2. make a new PML request flag DONT_FREE_ME (or some better name :-) ). 3. blocking MPI_SEND/MPI_RECV calls will set this flag (or, more specifically, the top of the PML calls for blocking send/receive) right when the request is allocated (i.e., before calling btl_send()). 4. when the PML is called for completion on this request, it will do all the stuff that it needs to effect completion -- but then it will see the DONT_FREE_ME flag and not actually free the request. Obviously, if DONT_FREE_ME is *not* set, then the PML does what it does today: it frees the request. 5. the top-level PML call will eventually complete: 5a. For blocking PML calls (e.g., corresponding to MPI_SEND and MPI_RECV), the request can be unconditionally freed. 5b. For non-blocking PML calls (e.g., corresponding to MPI_ISEND), only free the request if it was completed. Note that with this scheme, it becomes irrelevant as to whether the PML completion call is invoked on the first descent into the BTL or recursively via opal_progress. How does that sound? If that all works, it might be beneficial to put this back to the 1.2 branch because there are definitely apps that would benefit from it. On Oct 23, 2007, at 10:19 AM, Gleb Natapov wrote: So this problem goes WAY back.. The problem here is that the PML marks MPI completion just prior to calling btl_send and then returns to the user. This wouldn't be a problem if the BTL then did something, but in the case of OpenIB this fragment may not actually be on the wire (the joys of user level flow control). One solution that we proposed was to allow btl_send to return either OMPI_SUCCESS or OMPI_NOT_ON_WIRE. OMPI_NOT_ON_WIRE would allow the PML to not mark MPI completion of the fragment and then MPI_WAITALL and others will do there job properly. I even implemented this once, but there is a problem. Currently we mark request as completed on MPI level and then do btl_send(). Whenever IB completion will happen the request will be marked as complete on PML level and freed. The fix requires to change the order like this: Call btl_send(), check return value from BTL and mark request complete as necessary. The problem is that because we allow BTL to call opal_progress() internally the request may be already completed on MPI and MPL levels and freed before return from the call to btl_send(). I did a code review to see how hard it will be to get rid of recursion in Open MPI and I think this is doable. We have to disallow calling progress() (or other functions that may call progress() internally) from BTL and from ULP callbacks that are called by BTL. There is no much places that break this law. The main offenders are calls to FREE_LIST_WAIT(), but those never actually call progress if they can grow without limit and this is the most common use of FREE_LIST_WAIT() so they may be safely changed to FREE_LIST_GET(). After we will solve recursion problem the fix to the problem will be a couple of lines of code. - Galen On 10/11/07 11:26 AM, "Gleb Natapov" wrote: On Fri, Oct 05, 2007 at 09:43:44AM +0200, Jeff Squyres wrote: David -- Gleb and I just actively re-looked at this problem yesterday; we think it's related to https://svn.open-mpi.org/trac/ompi/ticket/ 1015. We previously thought this ticket was a different problem, but our analysis yesterday shows that it could be a real problem in the openib BTL or ob1 PML (kinda think it's the openib btl because it doesn't seem to happen on other networks, but who knows...). Gleb is investigating. Here is the result of the investigation. The problem is different than #1015 ticket. What we have here is one rank calls isend() of a small message and wait_all() in a loop and another one calls irecv(). The problem is that isend() usually doesn't call opal_progress() anywhere and wait_all() doesn't call progress if all requests are already c
Re: [OMPI devel] Multiworld MCA parameter values broken
Sorry for delay - wasn't ignoring the issue. There are several fixes to this problem - ranging in order from least to most work: 1. just alias "ssh" to be "ssh -Y" and run without setting the mca param. It won't affect anything on the backend because the daemon/procs don't use ssh. 2. include "pls_rsh_agent" in the array of mca params not to be passed to the orted in orte/mca/pls/base/pls_base_general_support_fns.c, the orte_pls_base_orted_append_basic_args function. This would fix the specific problem cited here, but I admit that listing every such param by name would get tedious. 3. we could easily detect that a "problem" character was in the mca param value when we add it to the orted's argv, and then put "" around it. The problem, however, is that the mca param parser on the far end doesn't remove those "" from the resulting string. At least, I spent over a day fighting with a problem only to discover that was happening. Could be an error in the way I was doing things, or could be a real characteristic of the parser. Anyway, we would have to ensure that the parser removes any surrounding "" before passing along the param value or this won't work. Ralph On 11/5/07 12:10 PM, "Tim Prins" wrote: > Hi, > > Commit 16364 broke things when using multiword mca param values. For > instance: > > mpirun --debug-daemons -mca orte_debug 1 -mca pls rsh -mca pls_rsh_agent > "ssh -Y" xterm > > Will crash and burn, because the value "ssh -Y" is being stored into the > argv orted_cmd_line in orterun.c:1506. This is then added to the launch > command for the orted: > > /usr/bin/ssh -Y odin004 PATH=/san/homedirs/tprins/usr/rsl/bin:$PATH ; > export PATH ; > LD_LIBRARY_PATH=/san/homedirs/tprins/usr/rsl/lib:$LD_LIBRARY_PATH ; > export LD_LIBRARY_PATH ; /san/homedirs/tprins/usr/rsl/bin/orted --debug > --debug-daemons --name 0.1 --num_procs 2 --vpid_start 0 --nodename > odin004 --universe tpr...@odin.cs.indiana.edu:default-universe-27872 > --nsreplica > "0.0;tcp://129.79.240.100:40907;tcp6://2001:18e8:2:240:2e0:81ff:fe2d:21a0:4090 > 8" > --gprreplica > "0.0;tcp://129.79.240.100:40907;tcp6://2001:18e8:2:240:2e0:81ff:fe2d:21a0:4090 > 8" > -mca orte_debug 1 -mca pls_rsh_agent ssh -Y -mca > mca_base_param_file_path > /u/tprins/usr/rsl/share/openmpi/amca-param-sets:/san/homedirs/tprins/rsl/examp > les > -mca mca_base_param_file_path_force /san/homedirs/tprins/rsl/examples > > Notice that in this command we now have "-mca pls_rsh_agent ssh -Y". So > the quotes have been lost, as we die a horrible death. > > So we need to add the quotes back in somehow, or pass these options > differently. I'm not sure what the best way to fix this. > > Thanks, > > Tim
[OMPI devel] carto framework requirements
Hi, I wrote some SW requirements for the caro framework. Please review this and post comments. Thanks. Sharon. carto_framework_requirements.pdf Description: carto_framework_requirements.pdf