Just a reminder that this RFC will go into the trunk this evening unless there are strong objections.
We intend to let this soak for a few days then bring it over to the 1.5 series (after the 1.5.0 release). -- Josh On Mar 15, 2010, at 9:26 AM, Josh Hursey wrote: > (Updated RFC, per offline discussion) > > WHAT: Merge a tmp branch for fault recovery development into the OMPI trunk > > WHY: Bring over work done by Josh and Ralph to extend OMPI's fault recovery > capabilities > > WHERE: Impacts a number of ORTE files and a ORTE ErrMgr framework > > TIMEOUT: Barring objections and/or further requests for delay, evening of > March 23 > > REFERENCE BRANCH: http://bitbucket.org/jjhursey/orte-errmgr/ > > ====================================================================== > > BACKGROUND: > > Josh and Ralph have been working on a private branch off of the trunk on > extended fault recovery procedures, mostly impacting ORTE. The new code > optionally allows ORTE to recover from failed nodes, moving processes to > other nodes in order to maintain operation. In addition, the code provides > better support for recovering from individual process failures. > > Not all of the work done on the private branch will be brought over in this > commit. Some of the MPI-specific code that allows recovery from process > failure on-the-fly will be committed separately at a later date. This commit > provides the foundation for ORTE stabilization that can be built upon to > provide OMPI layer stability in the future. > > This commit significantly modifies the ORTE ErrMgr framework to support those > advanced recovery operations. The ErrMgr public interface has been preserved > since it is used in various places throughout the codebase, and should > continue to be used as normal. The ErrMgr framework has been internally > redesigned to better support multiple strategies for responding to failures > (represents a merge of the old ErrMgr and the RecoS framework, into the > ErrMgr 3.0 component interface). The default (base) mode will continue to > work exactly the same as today, aborting the job when a failure occurs. > However, if the user elects to enable recovery then one or more ErrMgr > components will be activated to determine the recovery policy for the job. > > We have created a public repo (reference branch, above) with the code to be > merged into the trunk (r22815). Please feel free to check it out and test it. > > NOTE: The new recovery capability is only active if the user elects to use it > by setting the MCA parameter errmgr_base_enable_recovery to '1'. > > NOTE: More ErrMgr recovery components will be coming online in the near > future, currently this branch only includes the 'orcm' module for ORTE > process recovery (not MPI processes). If you want to experiment with this > feature, below are the MCA parameters that you will need to get started. >> ################################# >> plm=rsh >> rmaps=resilient >> routed=cm >> errmgr_base_enable_recovery=1 >> ################################# > > Comments, suggestions, and corrections are welcome! > > > > On Mar 10, 2010, at 2:22 PM, Josh Hursey wrote: > >> Wesley, >> >> Thanks for catching that oversight. Below are the MCA parameters that you >> should need at the moment: >> ##################################### >> # Use the C/R Process Migration Recovery Supervisor >> recos_base_enable=1 >> # Only use the 'rsh' launcher, other launchers will be supported later >> plm=rsh >> # The resilient mapper knows how to use RecoS and deal with recovering procs >> rmaps=resilient >> # 'cm' component is the only one that can handle failures at the moment >> routed=cm >> ##################################### >> >> Let me know if you have any troubles. >> >> -- Josh >> >> On Mar 10, 2010, at 10:36 AM, Wesley Bland wrote: >> >>> Josh, >>> >>> You mentioned some MCA parameters that you would include in the email, but >>> I don't see those parameters anywhere. Could you please put those in here >>> to make testing easier for people. >>> >>> Wesley >>> >>> On Wed, Mar 10, 2010 at 1:26 PM, Josh Hursey <jjhur...@open-mpi.org> wrote: >>> Yesterday evening George, Thomas and I discussed some of their concerns >>> about this RFC at the MPI Forum meeting. After the discussion, we seemed to >>> be in agreement that the RecoS framework is a good idea and the concepts >>> and fixes in this RFC should move forward with a couple of notes: >>> >>> - They wanted to test the branch a bit more over the next couple of days. >>> Some MCA parameters that you will need are at the bottom of this message. >>> >>> - Reiterate that this RFC only addresses ORTE stability, not OMPI >>> stability. The OMPI stability extension is a second step for the line of >>> work, and should/will fit in nicely with the RecoS framework being proposed >>> in this RFC. The OMPI layer stability will require a significant amount of >>> work, but the RecoS framework will provide the ORTE layer stability that is >>> required as a foundation for OMPI layer stability in the future. >>> >>> - The purpose of the ErrMgr becomes slightly unclear with the addition of >>> the RecoS framework, since both are focused on responding to faults in the >>> system (and RecoS, when enabled, overrides most/all of the ErrMgr >>> functionality). Should the RecoS framework be merged with the ErrMgr >>> framework to create a new ErrMgr interface? >>> >>> We are typing to decide if we should merge these frameworks, but at this >>> point we are interested in hearing how other developers feel about merging >>> the ErrMgr and RecoS frameworks, which would change the ErrMgr API. Are >>> there any developers out there that are developing ErrMgr components, or >>> are using any particular features of the existing ErrMgr framework that >>> they would like to see preserved in the next revision. By default, the >>> existing default abort behavior of the ErrMgr framework will be preserved, >>> so the user will have to 'opt-in' to any fault recovery capabilities. >>> >>> So we are continuing the discussion a bit more off-list, and will return to >>> the list with an updated RFC (and possibly a new branch) soon (hopefully >>> end of the week/early next week). I would like to briefly discuss this RFC >>> at the Open MPI teleconf next Tuesday. >>> >>> -- Josh >>> >>> On Feb 26, 2010, at 8:06 AM, Josh Hursey wrote: >>> >>>> Sounds good to me. >>>> >>>> For those casually following this RFC let me summarize its current state. >>>> >>>> Josh and George (and anyone else that wishes to participate attending the >>>> forum) will meet sometime at the next MPI Forum meeting (March 8-10). I >>>> will post any relevant notes from this meeting back to the list >>>> afterwards. So the RFC is on hold pending the outcome of that meeting. For >>>> those developers interested in this RFC that will not be able to attend, >>>> feel free to continue using this thread for discussion. >>>> >>>> Thanks, >>>> Josh >>>> >>>> On Feb 26, 2010, at 6:09 AM, George Bosilca wrote: >>>> >>>>> >>>>> On Feb 26, 2010, at 01:50 , Josh Hursey wrote: >>>>> >>>>>> Any of those options are fine with me. I was thinking that if you wanted >>>>>> to talk sooner, we might be able to help explain our intentions with >>>>>> this framework a bit better. I figure that the framework interface will >>>>>> change a bit as we all advance and incorporate our various techniques >>>>>> into it. I think that the current interface is a good first step, but >>>>>> there are certainly many more steps to come. >>>>>> >>>>>> I am fine delaying this code a bit, just not too long. Meeting at the >>>>>> forum for a while might be a good option (we could probably even arrange >>>>>> to call in others if you wanted). >>>>> >>>>> Sounds good, let do this. >>>>> >>>>> Thanks, >>>>> george. >>>>> >>>>>> >>>>>> Cheers, >>>>>> Josh >>>>>> >>>>>> On Feb 25, 2010, at 6:45 PM, Ralph Castain wrote: >>>>>> >>>>>>> If Josh is going to be at the forum, perhaps you folks could chat >>>>>>> there? Might as well take advantage of being colocated, if possible. >>>>>>> >>>>>>> Otherwise, I'm available pretty much any time. I can't contribute much >>>>>>> about the MPI recovery issues, but can contribute to the RTE issues if >>>>>>> that helps. >>>>>>> >>>>>>> >>>>>>> On Thu, Feb 25, 2010 at 7:39 PM, George Bosilca <bosi...@eecs.utk.edu> >>>>>>> wrote: >>>>>>> Josh, >>>>>>> >>>>>>> Next week is a little bit too early as will need some time to figure >>>>>>> out how to integrate with this new framework, and at what extent our >>>>>>> code and requirements fit into. Then the week after is the MPI Forum. >>>>>>> How about on Thursday 11 March? >>>>>>> >>>>>>> Thanks, >>>>>>> george. >>>>>>> >>>>>>> On Feb 25, 2010, at 12:46 , Josh Hursey wrote: >>>>>>> >>>>>>>> Per my previous suggestion, would it be useful to chat on the phone >>>>>>>> early next week about our various strategies? >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel