Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk
Hi George et al I have begun documenting the RecoS operation on the OMPI wiki: https://svn.open-mpi.org/trac/ompi/wiki/RecoS I'll continue to work on this over the next few days by adding a section explaining what was changed outside of the new framework to make it all work. In addition, I am revising the recos.h API documentation. Hope to have all that done over the weekend. On Feb 23, 2010, at 4:00 PM, Ralph Castain wrote: > > On Feb 23, 2010, at 3:32 PM, George Bosilca wrote: > >> Ralph, Josh, >> >> We have some comments about the API of the new framework, mostly >> clarifications needed to better understand how this new framework is >> supposed to be used. And a request for a deadline extension, to delay the >> code merge from the Recos branch in the trunk by a week. >> >> We have our own FT branch, with a totally different approach than what is >> described in your RFC. Unfortunately, it diverged from the trunk about a >> year ago, and merging back had proven to be a quite difficult task. Some of >> the functionality in the Recos framework is clearly beneficial for what we >> did, and has the potential to facilitate the porting of most of the features >> from our brach back in trunk. We would like the deadline extension in order >> to deeply analyze the impact of the Recos framework on our work, and see how >> we can fit everything together back in the trunk of Open MPI. > > No problem with the extension - feel free to suggest modifications to make > the merge easier. This is by no means cast in stone, but rather a starting > point. > >> >> Here are some comments about the code: >> >> 1. The documentation in recos.h is not very clear. Most of the functions use >> only IN arguments, and are not supposed to return any values. We don't see >> how the functions are supposed to be used, and what is supposed to be their >> impact on the ORTE framework data. > > I'll try to clarify the comments tonight (I know Josh is occupied right now). > The recos APIs are called from two locations: > > 1. The errmgr calls recos whenever it receives a report of an aborted process > (via the errmgr.proc_aborted API). The idea was for recos to determine what > (if anything) to do about the failed process. > > 2. The rmaps modules can call the recos "suggest_map_targets" API to get a > list of suggested nodes for the process that is to be restarted. At the > moment, only the resilient mapper module does this. However, Josh and I are > looking at reorganizing some functionality currently in that mapper module > and making all of the existing mappers be "resilient". > > So basically, the recos modules determine the recovery procedure and execute > it. For example, in the "orcm" module, we actually update the various > proc/job objects to prep them for restart and call plm.spawn from within that > module. If instead you use the ignore module, it falls through to the recos > base functions which call "abort" to kill the job. Again, the action is taken > local to recos, so nothing need be returned. > > The functions generally don't return values (other than success/error) > because we couldn't think of anything useful to return to the errmgr. > Whatever recos does about an aborted proc, the errmgr doesn't do anything > further - if you look in that code, you'll see that if recos is enabled, all > the errmgr does is call recos and return. > > Again, this can be changed if desired. > >> >> 2. Why do we have all the char***? Why are they only declared as IN >> arguments? > > I take it you mean in the predicted fault API? I believe Josh was including > that strictly as a placeholder. As you undoubtedly recall, I removed the fddp > framework from the trunk (devel continues off-line), so Josh wasn't sure what > I might want to input here. If you look at the modules themselves, you will > see the implementation is essentially empty at this time. > > We had discussed simply removing that API for now until we determined if/when > fault prediction would return to the OMPI trunk. It was kind of a tossup - so > we left if for now. Could just as easily be removed until a later date - > either way is fine with us. > >> >> 3. The orte_recos_base_process_fault_fn_t function use the node_list as an >> IN/OUT argument. Why? If the list is modified, then we have a scalability >> problem, as the list will have to be rebuilt before each call. > > Looking...looking...hmm. > > typedef int (*orte_recos_base_process_fault_fn_t) > (orte_job_t *jdata, orte_process_name_t *proec_name, orte_proc_state_t > state, int *stack_state); > > There is no node list, or list of any type, going in or out of that function. > I suspect you meant the one below it: > > typedef int (*orte_recos_base_suggest_map_targets_fn_t) > (orte_proc_t *proc, orte_node_t *oldnode, opal_list_t *node_list); > > I concur with your concern about scalability here. However, I believe the > idea was that we
Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk
Hi Ralph, Very interesting the "composite framework" idea. Regarding to the schema represented by the picture, I didn't understand the RecoS' behaviour in a node failure situation. In this case, will mpirun consider the daemon failure as a normal proc failure? If it is correct, should mpirun update the global procs state for all jobs running under the failed daemon? Best regards, Leonardo On Feb 25, 2010, at 7:05 AM, Ralph Castain wrote: > Hi George et al > > I have begun documenting the RecoS operation on the OMPI wiki: > > https://svn.open-mpi.org/trac/ompi/wiki/RecoS > > I'll continue to work on this over the next few days by adding a section > explaining what was changed outside of the new framework to make it all work. > In addition, I am revising the recos.h API documentation. > > Hope to have all that done over the weekend. > > > On Feb 23, 2010, at 4:00 PM, Ralph Castain wrote: > >> >> On Feb 23, 2010, at 3:32 PM, George Bosilca wrote: >> >>> Ralph, Josh, >>> >>> We have some comments about the API of the new framework, mostly >>> clarifications needed to better understand how this new framework is >>> supposed to be used. And a request for a deadline extension, to delay the >>> code merge from the Recos branch in the trunk by a week. >>> >>> We have our own FT branch, with a totally different approach than what is >>> described in your RFC. Unfortunately, it diverged from the trunk about a >>> year ago, and merging back had proven to be a quite difficult task. Some of >>> the functionality in the Recos framework is clearly beneficial for what we >>> did, and has the potential to facilitate the porting of most of the >>> features from our brach back in trunk. We would like the deadline extension >>> in order to deeply analyze the impact of the Recos framework on our work, >>> and see how we can fit everything together back in the trunk of Open MPI. >> >> No problem with the extension - feel free to suggest modifications to make >> the merge easier. This is by no means cast in stone, but rather a starting >> point. >> >>> >>> Here are some comments about the code: >>> >>> 1. The documentation in recos.h is not very clear. Most of the functions >>> use only IN arguments, and are not supposed to return any values. We don't >>> see how the functions are supposed to be used, and what is supposed to be >>> their impact on the ORTE framework data. >> >> I'll try to clarify the comments tonight (I know Josh is occupied right >> now). The recos APIs are called from two locations: >> >> 1. The errmgr calls recos whenever it receives a report of an aborted >> process (via the errmgr.proc_aborted API). The idea was for recos to >> determine what (if anything) to do about the failed process. >> >> 2. The rmaps modules can call the recos "suggest_map_targets" API to get a >> list of suggested nodes for the process that is to be restarted. At the >> moment, only the resilient mapper module does this. However, Josh and I are >> looking at reorganizing some functionality currently in that mapper module >> and making all of the existing mappers be "resilient". >> >> So basically, the recos modules determine the recovery procedure and execute >> it. For example, in the "orcm" module, we actually update the various >> proc/job objects to prep them for restart and call plm.spawn from within >> that module. If instead you use the ignore module, it falls through to the >> recos base functions which call "abort" to kill the job. Again, the action >> is taken local to recos, so nothing need be returned. >> >> The functions generally don't return values (other than success/error) >> because we couldn't think of anything useful to return to the errmgr. >> Whatever recos does about an aborted proc, the errmgr doesn't do anything >> further - if you look in that code, you'll see that if recos is enabled, all >> the errmgr does is call recos and return. >> >> Again, this can be changed if desired. >> >>> >>> 2. Why do we have all the char***? Why are they only declared as IN >>> arguments? >> >> I take it you mean in the predicted fault API? I believe Josh was including >> that strictly as a placeholder. As you undoubtedly recall, I removed the >> fddp framework from the trunk (devel continues off-line), so Josh wasn't >> sure what I might want to input here. If you look at the modules themselves, >> you will see the implementation is essentially empty at this time. >> >> We had discussed simply removing that API for now until we determined >> if/when fault prediction would return to the OMPI trunk. It was kind of a >> tossup - so we left if for now. Could just as easily be removed until a >> later date - either way is fine with us. >> >>> >>> 3. The orte_recos_base_process_fault_fn_t function use the node_list as an >>> IN/OUT argument. Why? If the list is modified, then we have a scalability >>> problem, as the list will have to be rebuilt before
Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk
Hi Ralph, Very interesting the "composite framework" idea. Regarding to the schema represented by the picture, I didn't understand the RecoS' behaviour in a node failure situation. In this case, will mpirun consider the daemon failure as a normal proc failure? If it is correct, should mpirun update the global procs state for all jobs running under the failed daemon? Best regards, Leonardo On Feb 25, 2010, at 7:05 AM, Ralph Castain wrote: > Hi George et al > > I have begun documenting the RecoS operation on the OMPI wiki: > > https://svn.open-mpi.org/trac/ompi/wiki/RecoS > > I'll continue to work on this over the next few days by adding a section > explaining what was changed outside of the new framework to make it all work. > In addition, I am revising the recos.h API documentation. > > Hope to have all that done over the weekend. > > > On Feb 23, 2010, at 4:00 PM, Ralph Castain wrote: > >> >> On Feb 23, 2010, at 3:32 PM, George Bosilca wrote: >> >>> Ralph, Josh, >>> >>> We have some comments about the API of the new framework, mostly >>> clarifications needed to better understand how this new framework is >>> supposed to be used. And a request for a deadline extension, to delay the >>> code merge from the Recos branch in the trunk by a week. >>> >>> We have our own FT branch, with a totally different approach than what is >>> described in your RFC. Unfortunately, it diverged from the trunk about a >>> year ago, and merging back had proven to be a quite difficult task. Some of >>> the functionality in the Recos framework is clearly beneficial for what we >>> did, and has the potential to facilitate the porting of most of the >>> features from our brach back in trunk. We would like the deadline extension >>> in order to deeply analyze the impact of the Recos framework on our work, >>> and see how we can fit everything together back in the trunk of Open MPI. >> >> No problem with the extension - feel free to suggest modifications to make >> the merge easier. This is by no means cast in stone, but rather a starting >> point. >> >>> >>> Here are some comments about the code: >>> >>> 1. The documentation in recos.h is not very clear. Most of the functions >>> use only IN arguments, and are not supposed to return any values. We don't >>> see how the functions are supposed to be used, and what is supposed to be >>> their impact on the ORTE framework data. >> >> I'll try to clarify the comments tonight (I know Josh is occupied right >> now). The recos APIs are called from two locations: >> >> 1. The errmgr calls recos whenever it receives a report of an aborted >> process (via the errmgr.proc_aborted API). The idea was for recos to >> determine what (if anything) to do about the failed process. >> >> 2. The rmaps modules can call the recos "suggest_map_targets" API to get a >> list of suggested nodes for the process that is to be restarted. At the >> moment, only the resilient mapper module does this. However, Josh and I are >> looking at reorganizing some functionality currently in that mapper module >> and making all of the existing mappers be "resilient". >> >> So basically, the recos modules determine the recovery procedure and execute >> it. For example, in the "orcm" module, we actually update the various >> proc/job objects to prep them for restart and call plm.spawn from within >> that module. If instead you use the ignore module, it falls through to the >> recos base functions which call "abort" to kill the job. Again, the action >> is taken local to recos, so nothing need be returned. >> >> The functions generally don't return values (other than success/error) >> because we couldn't think of anything useful to return to the errmgr. >> Whatever recos does about an aborted proc, the errmgr doesn't do anything >> further - if you look in that code, you'll see that if recos is enabled, all >> the errmgr does is call recos and return. >> >> Again, this can be changed if desired. >> >>> >>> 2. Why do we have all the char***? Why are they only declared as IN >>> arguments? >> >> I take it you mean in the predicted fault API? I believe Josh was including >> that strictly as a placeholder. As you undoubtedly recall, I removed the >> fddp framework from the trunk (devel continues off-line), so Josh wasn't >> sure what I might want to input here. If you look at the modules themselves, >> you will see the implementation is essentially empty at this time. >> >> We had discussed simply removing that API for now until we determined >> if/when fault prediction would return to the OMPI trunk. It was kind of a >> tossup - so we left if for now. Could just as easily be removed until a >> later date - either way is fine with us. >> >>> >>> 3. The orte_recos_base_process_fault_fn_t function use the node_list as an >>> IN/OUT argument. Why? If the list is modified, then we have a scalability >>> problem, as the list will have to be rebuilt before
Re: [OMPI devel] what's the relationship between proc, endpoint and btl?
Thanks a lot! i got it.Could you introduce some more materials for me to get better understood of the following functions: (1):/ompi/mca/pml/ob1/pml_ob1.c/mca_pml_ob1_add_procs (2):/ompi/mca/bml/r2/bml_r2.c/mca_bml_r2_add_procs (3):/ompi/mca/btl/tcp/btl_tcp.c/mca_btl_tcp_add_procs especially the second function, it's really hard to totally understand these functions. Thanks & Regards Yaohui Hu On Thu, Feb 25, 2010 at 10:34 AM, Jeff Squyres wrote: > On Feb 24, 2010, at 12:16 PM, Aurélien Bouteiller wrote: > > > btl is the component responsible for a particular type of fabric. > Endpoint is somewhat the instantiation of a btl to reach a particular > destination on a particular fabric, proc is the generic name and properties > of a destination. > > A few more words here... > > btl = Byte Transfer Layer. It's our name for the framework that governs > one flavor of point-to-point communications in the MPI layer. Components in > this framework are used by the ob1 and csum PMLs to effect MPI > point-to-point communications (they're used in other ways, too, but let's > start at the beginning here...). There are several btl components: tcp, sm > (shared memory), self (process loopback), openib (OpenFabrics), ...etc. > Each one of these effects communications over a different network type. > For purposes of this discussion, "component" == "plugin". > > The btl plugin is loaded into an MPI process and its component open/query > functions are called. If the btl component determines that it wants to run, > it returns one or more modules. Typically, btls return a module for every > interface that they find. For example, if the openib module finds 2 > OpenFabrics device ports, it'll return 2 modules. > > Hence, we typically describe components as analogous to a C++ class; > modules are analogous to instances of that C++ class. > > Note that in many BTL component comments and variables/fields, they > typically use shorthand language such as, "The btl then does this..." Such > language almost always refers to a specific module of that btl component. > > Modules are marshalled by the bml and ob1/csum to make an ordered list of > who can talk to whom. > > Endpoints are data structures used to represent a module's connection to a > remote MPI process (proc). Hence, a BTL component can create multiple > modules; each module can create lots of endpoints. Each endpoint is tied to > a specific remote proc. > > > Aurelien > > > > Le 24 févr. 2010 à 09:59, hu yaohui a écrit : > > > > > Could someone tell me the relationship between proc,endpoint and btl? > > > thanks & regards > > > ___ > > > devel mailing list > > > de...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > > > ___ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk
On Feb 25, 2010, at 1:41 AM, Leonardo Fialho wrote: > Hi Ralph, > > Very interesting the "composite framework" idea. Josh is the force behind that idea :-) > Regarding to the schema represented by the picture, I didn't understand the > RecoS' behaviour in a node failure situation. > > In this case, will mpirun consider the daemon failure as a normal proc > failure? If it is correct, should mpirun update the global procs state for > all jobs running under the failed daemon? I haven't included the node failure case yet - still on my "to-do" list. In brief, the answer is yes/no. :-) Daemon failure follows the same code path as shown in the flow chart. However, it is up to the individual modules to determine a response to that failure. The "orcm" RecoS module response is to (a) mark all procs on that node as having failed, (b) mark that node as "down" so it won't get reused, and (c) remap and restart all such procs on the remaining available nodes, starting new daemon(s) as required. In the orcm environment, nodes that are replaced or rebooted automatically start their own daemon. This is detected by orcm, and the node state (if the node is rebooted) will automatically be updated to "up" - if it is a new node, it is automatically added to the available resources. This allows the node to be reused once the problem has been corrected. In other environments (ssh, slurm, etc), the node is simply left as "down" as there is no way to know if/when the node becomes available again. If you aren't using the "orcm" module, then the default behavior will abort the job. > > Best regards, > Leonardo > > On Feb 25, 2010, at 7:05 AM, Ralph Castain wrote: > >> Hi George et al >> >> I have begun documenting the RecoS operation on the OMPI wiki: >> >> https://svn.open-mpi.org/trac/ompi/wiki/RecoS >> >> I'll continue to work on this over the next few days by adding a section >> explaining what was changed outside of the new framework to make it all >> work. In addition, I am revising the recos.h API documentation. >> >> Hope to have all that done over the weekend. >> >> >> On Feb 23, 2010, at 4:00 PM, Ralph Castain wrote: >> >>> >>> On Feb 23, 2010, at 3:32 PM, George Bosilca wrote: >>> Ralph, Josh, We have some comments about the API of the new framework, mostly clarifications needed to better understand how this new framework is supposed to be used. And a request for a deadline extension, to delay the code merge from the Recos branch in the trunk by a week. We have our own FT branch, with a totally different approach than what is described in your RFC. Unfortunately, it diverged from the trunk about a year ago, and merging back had proven to be a quite difficult task. Some of the functionality in the Recos framework is clearly beneficial for what we did, and has the potential to facilitate the porting of most of the features from our brach back in trunk. We would like the deadline extension in order to deeply analyze the impact of the Recos framework on our work, and see how we can fit everything together back in the trunk of Open MPI. >>> >>> No problem with the extension - feel free to suggest modifications to make >>> the merge easier. This is by no means cast in stone, but rather a starting >>> point. >>> Here are some comments about the code: 1. The documentation in recos.h is not very clear. Most of the functions use only IN arguments, and are not supposed to return any values. We don't see how the functions are supposed to be used, and what is supposed to be their impact on the ORTE framework data. >>> >>> I'll try to clarify the comments tonight (I know Josh is occupied right >>> now). The recos APIs are called from two locations: >>> >>> 1. The errmgr calls recos whenever it receives a report of an aborted >>> process (via the errmgr.proc_aborted API). The idea was for recos to >>> determine what (if anything) to do about the failed process. >>> >>> 2. The rmaps modules can call the recos "suggest_map_targets" API to get a >>> list of suggested nodes for the process that is to be restarted. At the >>> moment, only the resilient mapper module does this. However, Josh and I are >>> looking at reorganizing some functionality currently in that mapper module >>> and making all of the existing mappers be "resilient". >>> >>> So basically, the recos modules determine the recovery procedure and >>> execute it. For example, in the "orcm" module, we actually update the >>> various proc/job objects to prep them for restart and call plm.spawn from >>> within that module. If instead you use the ignore module, it falls through >>> to the recos base functions which call "abort" to kill the job. Again, the >>> action is taken local to recos, so nothing need be returned. >>> >>> The functions generally don't return values (other than success/
Re: [OMPI devel] what's the relationship between proc, endpoint and btl?
On Feb 25, 2010, at 7:14 AM, hu yaohui wrote: > Thanks a lot! i got it.Could you introduce some more materials for me to get > better understood of the following functions: > (1):/ompi/mca/pml/ob1/pml_ob1.c/mca_pml_ob1_add_procs This is just the OB1 function to add new peer processes. It's called by the MPI layer -- e.g., during MPI_INIT, MPI_COMM_SPAWN, etc. > (2):/ompi/mca/bml/r2/bml_r2.c/mca_bml_r2_add_procs The BML is the BTL Multiplexing Layer. It's just a multiplexer for marshalling multiple BTL's together. It has no message passing functionality in itself -- it just finds and dispatches to underlying BTL's. > (3):/ompi/mca/btl/tcp/btl_tcp.c/mca_btl_tcp_add_procs Check out the description of the BTL add_procs function in ompi/mca/btl/btl.h. This is the TCP BTL component's add_procs function. Every BTL has one. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk
On Feb 23, 2010, at 3:00 PM, Ralph Castain wrote: > > On Feb 23, 2010, at 3:32 PM, George Bosilca wrote: > >> Ralph, Josh, >> >> We have some comments about the API of the new framework, mostly >> clarifications needed to better understand how this new framework is >> supposed to be used. And a request for a deadline extension, to delay the >> code merge from the Recos branch in the trunk by a week. >> >> We have our own FT branch, with a totally different approach than what is >> described in your RFC. Unfortunately, it diverged from the trunk about a >> year ago, and merging back had proven to be a quite difficult task. Some of >> the functionality in the Recos framework is clearly beneficial for what we >> did, and has the potential to facilitate the porting of most of the features >> from our brach back in trunk. We would like the deadline extension in order >> to deeply analyze the impact of the Recos framework on our work, and see how >> we can fit everything together back in the trunk of Open MPI. > > No problem with the extension - feel free to suggest modifications to make > the merge easier. This is by no means cast in stone, but rather a starting > point. Additionally, if you wanted to have a teleconf next week to increase the bandwidth of communication we can do that as well. Might help us negotiate some modifications that would be mutually beneficial. Unfortunately I am currently at a conference so cannot call in until Monday. > >> >> Here are some comments about the code: >> >> 1. The documentation in recos.h is not very clear. Most of the functions use >> only IN arguments, and are not supposed to return any values. We don't see >> how the functions are supposed to be used, and what is supposed to be their >> impact on the ORTE framework data. > > I'll try to clarify the comments tonight (I know Josh is occupied right now). > The recos APIs are called from two locations: > > 1. The errmgr calls recos whenever it receives a report of an aborted process > (via the errmgr.proc_aborted API). The idea was for recos to determine what > (if anything) to do about the failed process. > > 2. The rmaps modules can call the recos "suggest_map_targets" API to get a > list of suggested nodes for the process that is to be restarted. At the > moment, only the resilient mapper module does this. However, Josh and I are > looking at reorganizing some functionality currently in that mapper module > and making all of the existing mappers be "resilient". > > So basically, the recos modules determine the recovery procedure and execute > it. For example, in the "orcm" module, we actually update the various > proc/job objects to prep them for restart and call plm.spawn from within that > module. If instead you use the ignore module, it falls through to the recos > base functions which call "abort" to kill the job. Again, the action is taken > local to recos, so nothing need be returned. > > The functions generally don't return values (other than success/error) > because we couldn't think of anything useful to return to the errmgr. > Whatever recos does about an aborted proc, the errmgr doesn't do anything > further - if you look in that code, you'll see that if recos is enabled, all > the errmgr does is call recos and return. > > Again, this can be changed if desired. > >> >> 2. Why do we have all the char***? Why are they only declared as IN >> arguments? > > I take it you mean in the predicted fault API? I believe Josh was including > that strictly as a placeholder. As you undoubtedly recall, I removed the fddp > framework from the trunk (devel continues off-line), so Josh wasn't sure what > I might want to input here. If you look at the modules themselves, you will > see the implementation is essentially empty at this time. > > We had discussed simply removing that API for now until we determined if/when > fault prediction would return to the OMPI trunk. It was kind of a tossup - so > we left if for now. Could just as easily be removed until a later date - > either way is fine with us. In this version of the components, none of them use the predicted_fault API. I have at least one component that will come in as a second step (so soon, but different RFC) that does use this interface to do some super nifty things (if I say so myself :). We can remove the interface if people have heartburn about it being there, but we will want to add it back in soon enough. As far as the 'char ***' parameters they really should just be IN parameters. They are not passed back to the suggestion/detection agent (though I guess they could be). In recognition of some of the broader uses of this interface I am considering changing them to a list of RecoS specific structures that would allow the caller of this function to pass additional information for each of the parameters (like an assurance level of the fault - 75% sure this proc is failed). So we would cha
Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk
On Feb 25, 2010, at 4:38 AM, Ralph Castain wrote: > > On Feb 25, 2010, at 1:41 AM, Leonardo Fialho wrote: > >> Hi Ralph, >> >> Very interesting the "composite framework" idea. > > Josh is the force behind that idea :-) It solves a pretty interesting little problem. Its utility will really shine when I move the new components into place in the coming weeks/month. > >> Regarding to the schema represented by the picture, I didn't understand the >> RecoS' behaviour in a node failure situation. >> >> In this case, will mpirun consider the daemon failure as a normal proc >> failure? If it is correct, should mpirun update the global procs state for >> all jobs running under the failed daemon? > > I haven't included the node failure case yet - still on my "to-do" list. In > brief, the answer is yes/no. :-) > > Daemon failure follows the same code path as shown in the flow chart. > However, it is up to the individual modules to determine a response to that > failure. The "orcm" RecoS module response is to (a) mark all procs on that > node as having failed, (b) mark that node as "down" so it won't get reused, > and (c) remap and restart all such procs on the remaining available nodes, > starting new daemon(s) as required. > > In the orcm environment, nodes that are replaced or rebooted automatically > start their own daemon. This is detected by orcm, and the node state (if the > node is rebooted) will automatically be updated to "up" - if it is a new > node, it is automatically added to the available resources. This allows the > node to be reused once the problem has been corrected. In other environments > (ssh, slurm, etc), the node is simply left as "down" as there is no way to > know if/when the node becomes available again. > > If you aren't using the "orcm" module, then the default behavior will abort > the job. Just to echo this response. The orted and process failures use the same error path, but can be easily differentiated by their jobids. The 'orcm' component is a good example of differentiating these two fault scenarios to correctly recover the ORTE job. Soon we may/should/will have the same ability with certain MPI jobs. :) -- Josh > > >> >> Best regards, >> Leonardo >> >> On Feb 25, 2010, at 7:05 AM, Ralph Castain wrote: >> >>> Hi George et al >>> >>> I have begun documenting the RecoS operation on the OMPI wiki: >>> >>> https://svn.open-mpi.org/trac/ompi/wiki/RecoS >>> >>> I'll continue to work on this over the next few days by adding a section >>> explaining what was changed outside of the new framework to make it all >>> work. In addition, I am revising the recos.h API documentation. >>> >>> Hope to have all that done over the weekend. >>> >>> >>> On Feb 23, 2010, at 4:00 PM, Ralph Castain wrote: >>> On Feb 23, 2010, at 3:32 PM, George Bosilca wrote: > Ralph, Josh, > > We have some comments about the API of the new framework, mostly > clarifications needed to better understand how this new framework is > supposed to be used. And a request for a deadline extension, to delay the > code merge from the Recos branch in the trunk by a week. > > We have our own FT branch, with a totally different approach than what is > described in your RFC. Unfortunately, it diverged from the trunk about a > year ago, and merging back had proven to be a quite difficult task. Some > of the functionality in the Recos framework is clearly beneficial for > what we did, and has the potential to facilitate the porting of most of > the features from our brach back in trunk. We would like the deadline > extension in order to deeply analyze the impact of the Recos framework on > our work, and see how we can fit everything together back in the trunk of > Open MPI. No problem with the extension - feel free to suggest modifications to make the merge easier. This is by no means cast in stone, but rather a starting point. > > Here are some comments about the code: > > 1. The documentation in recos.h is not very clear. Most of the functions > use only IN arguments, and are not supposed to return any values. We > don't see how the functions are supposed to be used, and what is supposed > to be their impact on the ORTE framework data. I'll try to clarify the comments tonight (I know Josh is occupied right now). The recos APIs are called from two locations: 1. The errmgr calls recos whenever it receives a report of an aborted process (via the errmgr.proc_aborted API). The idea was for recos to determine what (if anything) to do about the failed process. 2. The rmaps modules can call the recos "suggest_map_targets" API to get a list of suggested nodes for the process that is to be restarted. At the moment, only the resilient mapper module does this. However, Josh and I >>>
Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk
Hi Ralph and Josh, >>> Regarding to the schema represented by the picture, I didn't understand the >>> RecoS' behaviour in a node failure situation. >>> >>> In this case, will mpirun consider the daemon failure as a normal proc >>> failure? If it is correct, should mpirun update the global procs state for >>> all jobs running under the failed daemon? >> >> I haven't included the node failure case yet - still on my "to-do" list. In >> brief, the answer is yes/no. :-) >> >> Daemon failure follows the same code path as shown in the flow chart. >> However, it is up to the individual modules to determine a response to that >> failure. The "orcm" RecoS module response is to (a) mark all procs on that >> node as having failed, (b) mark that node as "down" so it won't get reused, >> and (c) remap and restart all such procs on the remaining available nodes, >> starting new daemon(s) as required. >> >> In the orcm environment, nodes that are replaced or rebooted automatically >> start their own daemon. This is detected by orcm, and the node state (if the >> node is rebooted) will automatically be updated to "up" - if it is a new >> node, it is automatically added to the available resources. This allows the >> node to be reused once the problem has been corrected. In other environments >> (ssh, slurm, etc), the node is simply left as "down" as there is no way to >> know if/when the node becomes available again. >> >> If you aren't using the "orcm" module, then the default behavior will abort >> the job. > > Just to echo this response. The orted and process failures use the same error > path, but can be easily differentiated by their jobids. The 'orcm' component > is a good example of differentiating these two fault scenarios to correctly > recover the ORTE job. Soon we may/should/will have the same ability with > certain MPI jobs. :) Hum... I'm really afraid about this. I understand your choice since it is really a good solution for fail/stop/restart behaviour, but looking from the fail/recovery side, can you envision some alternative for the orted's reconfiguration on the fly? Best regards, Leonardo
Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk
On Feb 25, 2010, at 11:16 , Leonardo Fialho wrote: > Hum... I'm really afraid about this. I understand your choice since it is > really a good solution for fail/stop/restart behaviour, but looking from the > fail/recovery side, can you envision some alternative for the orted's > reconfiguration on the fly? Leonardo, I don't see why the current code prohibit such behavior. However, I don't see right now in this branch how the remaining daemons (and MPI processes) reconstruct the communication topology, but this is just a technicality. Anyway, this is the code that UT will bring in. All our work focus on maintaining the exiting environment up and running instead of restarting everything. The orted will auto-heal (i.e reshape the underlying topology, recreate the connections, and so on), and the fault is propagated to the MPI layer who will take the decision on what to do next. george.
Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk
Hi George, >> Hum... I'm really afraid about this. I understand your choice since it is >> really a good solution for fail/stop/restart behaviour, but looking from the >> fail/recovery side, can you envision some alternative for the orted's >> reconfiguration on the fly? > > I don't see why the current code prohibit such behavior. However, I don't see > right now in this branch how the remaining daemons (and MPI processes) > reconstruct the communication topology, but this is just a technicality. > > Anyway, this is the code that UT will bring in. All our work focus on > maintaining the exiting environment up and running instead of restarting > everything. The orted will auto-heal (i.e reshape the underlying topology, > recreate the connections, and so on), and the fault is propagated to the MPI > layer who will take the decision on what to do next. When you say MPI layer, what exactly it means? The MPI interface or the network stack which supports the MPI communication (BTL, PML, etc.)? In my mind I see an orted failure (and all procs running under this deamon) as an environment failure which leads to job failures. Thus, to use a fail/recovery strategy, this daemons should be recovered (possibly relaunching and updating its procs/jobs structures) and after that all failed procs which are originally running under this daemon should be recovered also (maybe from a checkpoint, log optionally). Of course, in available, an spare orted could be used. Regarding to the MPI application, probably this 'environment reconfiguration' requires updates/reconfiguration/whatever on the communication stack which supports the MPI communication (BTL, PML, etc.). Are we thinking in the same direction or I have missed something in the way? Best regards, Leonardo
Re: [OMPI devel] question about pids
Ralph, We'd like this to be able to support attaching a debugger to the application. Would it be difficult to provide? We don't need the information all at once, each PID could be sent as the process launches (as long as the XML is correctly formatted) if that makes it any easier. Greg On Feb 23, 2010, at 3:58 PM, Ralph Castain wrote: > I don't see a way to currently do that - the rmaps display comes -before- > process launch, so the pid will not be displayed. > > Do you need to see them? We'd have to add that output somewhere post-launch - > perhaps when debuggers are initialized. > > On Feb 23, 2010, at 12:58 PM, Greg Watson wrote: > >> Ralph, >> >> I notice that you've got support in the XML output code to display the pids >> of the processes, but I can't see how to enable them. Can you give me any >> pointers? >> >> Thanks, >> Greg >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk
On Feb 25, 2010, at 8:32 AM, George Bosilca wrote: > > On Feb 25, 2010, at 11:16 , Leonardo Fialho wrote: > >> Hum... I'm really afraid about this. I understand your choice since it is >> really a good solution for fail/stop/restart behaviour, but looking from the >> fail/recovery side, can you envision some alternative for the orted's >> reconfiguration on the fly? > > Leonardo, > > I don't see why the current code prohibit such behavior. However, I don't see > right now in this branch how the remaining daemons (and MPI processes) > reconstruct the communication topology, but this is just a technicality. If you use the 'cm' routed component then the reconstruction of the ORTE level communication works for all but the loss of the HNP. Neither Ralph or I have looked at supporting other routed components at this time. I know your group at UTK has some done work in this area so we wanted to tackle additional support for more scalable routed components as a second step, hopefully with collaboration from your group. As far as the MPI layer, I can't say much at this point on how that works. This RFC only handles recovery of the ORTE layer, MPI layer recovery is a second step and involves much longer discussions. I have a solution for a certain type of MPI application, and it sounds like you have something that can be applied more generally. > > Anyway, this is the code that UT will bring in. All our work focus on > maintaining the exiting environment up and running instead of restarting > everything. The orted will auto-heal (i.e reshape the underlying topology, > recreate the connections, and so on), and the fault is propagated to the MPI > layer who will take the decision on what to do next. Per my previous suggestion, would it be useful to chat on the phone early next week about our various strategies? -- Josh > > george. > > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] question about pids
Have you looked at orte-ps? It contains all the information you'll need to attach a debugger to a already running application. Ashley, On 25 Feb 2010, at 17:43, Greg Watson wrote: > Ralph, > > We'd like this to be able to support attaching a debugger to the application. > Would it be difficult to provide? We don't need the information all at once, > each PID could be sent as the process launches (as long as the XML is > correctly formatted) if that makes it any easier. > > Greg > > On Feb 23, 2010, at 3:58 PM, Ralph Castain wrote: > >> I don't see a way to currently do that - the rmaps display comes -before- >> process launch, so the pid will not be displayed. >> >> Do you need to see them? We'd have to add that output somewhere post-launch >> - perhaps when debuggers are initialized. >> >> On Feb 23, 2010, at 12:58 PM, Greg Watson wrote: >> >>> Ralph, >>> >>> I notice that you've got support in the XML output code to display the pids >>> of the processes, but I can't see how to enable them. Can you give me any >>> pointers? -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
[OMPI devel] RFC: increase default AC/AM/LT requirements
WHAT: Bump minimum required versions of GNU autotools up to modern versions. I suggest the following, but could be talked down a version or two: Autoconf: 2.65 Automake: 1.11.1 Libtool: 2.2.6b WHY: Stop carrying patches and workarounds for old versions. WHERE: autogen.sh, make_dist_tarball, various Makefile.am's, configure.ac, *.m4. WHEN: No real rush. Somewhere in 1.5.x. TIMEOUT: Friday March 5, 2010 I was debugging a complex Automake timestamp issue yesterday and discovered that it was caused by the fact that we are patching an old version of libtool.m4. It took a little while to figure out both the problem and an acceptable workaround. During this process, I noticed that autogen.sh still carries patches to fix bugs in some *really* old versions of Libtool (e.g., 1.5.22). Hence, I am send this RFC to increase the minimum required versions. Keep in mind: 1. This ONLY affects developers. Those who build from tarballs don't even need to have the Autotools installed. 2. Autotool patches should always be pushed upstream. We should only maintain patches for things that have been pushed upstream but have not yet been released. 3. We already have much more recent Autotools requirements for official distribution tarballs; see the chart here: http://www.open-mpi.org/svn/building.php Specifically: although official tarballs require recent Autotools, we allow developers to use much older versions. Why are we still carrying around this old kruft? Does some developer out there have a requirement to use older Autotools? If not, this RFC proposes to only allow recent versions of the Autotools to build Open MPI. I believe there's reasonable m4 these days that can make autogen/configure/whatever abort early if the versions are not new enough. This would allow us, at a minimum, to drop some of the libtool patches we're carrying. There may be some Makefile.am workarounds that are no longer necessary, too. There's no real rush on this; if this RFC passes, we can set a concrete, fixed date some point in the future where we switch over to requiring new versions. This should give everyone plenty of time to update if you need to, etc. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] RFC: increase default AC/AM/LT requirements
I think our last set of minimums was based on being able to use RHEL4 out of the box. Updating to whatever ships with RHEL5 probably makes sense, but I think that still leaves you at a LT 1.5.x release. Being higher than that requires new Autotools, which seems like asking for trouble. Brian On Feb 25, 2010, at 4:47 PM, Jeff Squyres wrote: > WHAT: Bump minimum required versions of GNU autotools up to modern versions. > I suggest the following, but could be talked down a version or two: > Autoconf: 2.65 > Automake: 1.11.1 > Libtool: 2.2.6b > > WHY: Stop carrying patches and workarounds for old versions. > > WHERE: autogen.sh, make_dist_tarball, various Makefile.am's, configure.ac, > *.m4. > > WHEN: No real rush. Somewhere in 1.5.x. > > TIMEOUT: Friday March 5, 2010 > > > > I was debugging a complex Automake timestamp issue yesterday and discovered > that it was caused by the fact that we are patching an old version of > libtool.m4. It took a little while to figure out both the problem and an > acceptable workaround. During this process, I noticed that autogen.sh still > carries patches to fix bugs in some *really* old versions of Libtool (e.g., > 1.5.22). Hence, I am send this RFC to increase the minimum required versions. > > Keep in mind: > > 1. This ONLY affects developers. Those who build from tarballs don't even > need to have the Autotools installed. > 2. Autotool patches should always be pushed upstream. We should only > maintain patches for things that have been pushed upstream but have not yet > been released. > 3. We already have much more recent Autotools requirements for official > distribution tarballs; see the chart here: > >http://www.open-mpi.org/svn/building.php > > Specifically: although official tarballs require recent Autotools, we allow > developers to use much older versions. Why are we still carrying around > this old kruft? Does some developer out there have a requirement to use > older Autotools? > > If not, this RFC proposes to only allow recent versions of the Autotools to > build Open MPI. I believe there's reasonable m4 these days that can make > autogen/configure/whatever abort early if the versions are not new enough. > This would allow us, at a minimum, to drop some of the libtool patches we're > carrying. There may be some Makefile.am workarounds that are no longer > necessary, too. > > There's no real rush on this; if this RFC passes, we can set a concrete, > fixed date some point in the future where we switch over to requiring new > versions. This should give everyone plenty of time to update if you need to, > etc. > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Brian W. Barrett Dept. 1423: Scalable System Software Sandia National Laboratories
Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk
I believe you are thinking parallel to what Josh and I have been doing, and slightly different to the UTK approach. The "orcm" method follows what you describe: we maintain operation on the current remaining nodes, see if we can use another new node to replace the failed one, and redistribute the affected procs (on the failed node) either to existing nodes or to new ones. I believe UTK's approach focuses on retaining operation of the existing nodes, redistributing procs across them. I suspect we will eventually integrate some of these operations so that users can exploit the best of both methods. Josh hasn't exposed his MPI recovery work yet. As he mentioned in his response, he has done some things in this area that are complementary to the UTK method. Just needs to finish his thesis before making them public. :-) On Thu, Feb 25, 2010 at 9:54 AM, Leonardo Fialho wrote: > Hi George, > > >> Hum... I'm really afraid about this. I understand your choice since it > is really a good solution for fail/stop/restart behaviour, but looking from > the fail/recovery side, can you envision some alternative for the orted's > reconfiguration on the fly? > > > > I don't see why the current code prohibit such behavior. However, I don't > see right now in this branch how the remaining daemons (and MPI processes) > reconstruct the communication topology, but this is just a technicality. > > > > Anyway, this is the code that UT will bring in. All our work focus on > maintaining the exiting environment up and running instead of restarting > everything. The orted will auto-heal (i.e reshape the underlying topology, > recreate the connections, and so on), and the fault is propagated to the MPI > layer who will take the decision on what to do next. > > > When you say MPI layer, what exactly it means? The MPI interface or the > network stack which supports the MPI communication (BTL, PML, etc.)? > > In my mind I see an orted failure (and all procs running under this deamon) > as an environment failure which leads to job failures. Thus, to use a > fail/recovery strategy, this daemons should be recovered (possibly > relaunching and updating its procs/jobs structures) and after that all > failed procs which are originally running under this daemon should be > recovered also (maybe from a checkpoint, log optionally). Of course, in > available, an spare orted could be used. > > Regarding to the MPI application, probably this 'environment > reconfiguration' requires updates/reconfiguration/whatever on the > communication stack which supports the MPI communication (BTL, PML, etc.). > > Are we thinking in the same direction or I have missed something in the > way? > > Best regards, > Leonardo > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk
Just to add to Josh's comment: I am working now on recovering from HNP failure as well. Should have that in a month or so. On Thu, Feb 25, 2010 at 10:46 AM, Josh Hursey wrote: > > On Feb 25, 2010, at 8:32 AM, George Bosilca wrote: > > > > > On Feb 25, 2010, at 11:16 , Leonardo Fialho wrote: > > > >> Hum... I'm really afraid about this. I understand your choice since it > is really a good solution for fail/stop/restart behaviour, but looking from > the fail/recovery side, can you envision some alternative for the orted's > reconfiguration on the fly? > > > > Leonardo, > > > > I don't see why the current code prohibit such behavior. However, I don't > see right now in this branch how the remaining daemons (and MPI processes) > reconstruct the communication topology, but this is just a technicality. > > If you use the 'cm' routed component then the reconstruction of the ORTE > level communication works for all but the loss of the HNP. Neither Ralph or > I have looked at supporting other routed components at this time. I know > your group at UTK has some done work in this area so we wanted to tackle > additional support for more scalable routed components as a second step, > hopefully with collaboration from your group. > > As far as the MPI layer, I can't say much at this point on how that works. > This RFC only handles recovery of the ORTE layer, MPI layer recovery is a > second step and involves much longer discussions. I have a solution for a > certain type of MPI application, and it sounds like you have something that > can be applied more generally. > > > > > Anyway, this is the code that UT will bring in. All our work focus on > maintaining the exiting environment up and running instead of restarting > everything. The orted will auto-heal (i.e reshape the underlying topology, > recreate the connections, and so on), and the fault is propagated to the MPI > layer who will take the decision on what to do next. > > Per my previous suggestion, would it be useful to chat on the phone early > next week about our various strategies? > > -- Josh > > > > > > george. > > > > > > > > ___ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] question about pids
Easy to do. I'll dump all the pids at the same time when the launch completes - effectively, it will be at the same point used by other debuggers to attach. Have it for you in the trunk this weekend. Can you suggest an xml format you would like? Otherwise, I'll just use the current proc output (used in the map output) and add a "pid" field to it. On Thu, Feb 25, 2010 at 10:43 AM, Greg Watson wrote: > Ralph, > > We'd like this to be able to support attaching a debugger to the > application. Would it be difficult to provide? We don't need the information > all at once, each PID could be sent as the process launches (as long as the > XML is correctly formatted) if that makes it any easier. > > Greg > > On Feb 23, 2010, at 3:58 PM, Ralph Castain wrote: > > > I don't see a way to currently do that - the rmaps display comes -before- > process launch, so the pid will not be displayed. > > > > Do you need to see them? We'd have to add that output somewhere > post-launch - perhaps when debuggers are initialized. > > > > On Feb 23, 2010, at 12:58 PM, Greg Watson wrote: > > > >> Ralph, > >> > >> I notice that you've got support in the XML output code to display the > pids of the processes, but I can't see how to enable them. Can you give me > any pointers? > >> > >> Thanks, > >> Greg > >> ___ > >> devel mailing list > >> de...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > > > ___ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk
Josh, Next week is a little bit too early as will need some time to figure out how to integrate with this new framework, and at what extent our code and requirements fit into. Then the week after is the MPI Forum. How about on Thursday 11 March? Thanks, george. On Feb 25, 2010, at 12:46 , Josh Hursey wrote: > Per my previous suggestion, would it be useful to chat on the phone early > next week about our various strategies?
Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk
If Josh is going to be at the forum, perhaps you folks could chat there? Might as well take advantage of being colocated, if possible. Otherwise, I'm available pretty much any time. I can't contribute much about the MPI recovery issues, but can contribute to the RTE issues if that helps. On Thu, Feb 25, 2010 at 7:39 PM, George Bosilca wrote: > Josh, > > Next week is a little bit too early as will need some time to figure out > how to integrate with this new framework, and at what extent our code and > requirements fit into. Then the week after is the MPI Forum. How about on > Thursday 11 March? > > Thanks, > george. > > On Feb 25, 2010, at 12:46 , Josh Hursey wrote: > > > Per my previous suggestion, would it be useful to chat on the phone early > next week about our various strategies? > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >