Just a reminder that this RFC will go into the trunk this evening unless there 
are strong objections.

We intend to let this soak for a few days then bring it over to the 1.5 series 
(after the 1.5.0 release).

-- Josh

On Mar 15, 2010, at 9:26 AM, Josh Hursey wrote:

> (Updated RFC, per offline discussion)
> 
> WHAT: Merge a tmp branch for fault recovery development into the OMPI trunk
> 
> WHY: Bring over work done by Josh and Ralph to extend OMPI's fault recovery 
> capabilities
> 
> WHERE: Impacts a number of ORTE files and a ORTE ErrMgr framework
> 
> TIMEOUT: Barring objections and/or further requests for delay, evening of 
> March 23
> 
> REFERENCE BRANCH: http://bitbucket.org/jjhursey/orte-errmgr/
> 
> ======================================================================
> 
> BACKGROUND:
> 
> Josh and Ralph have been working on a private branch off of the trunk on 
> extended fault recovery procedures, mostly impacting ORTE. The new code 
> optionally allows ORTE to recover from failed nodes, moving processes to 
> other nodes in order to maintain operation. In addition, the code provides 
> better support for recovering from individual process failures.
> 
> Not all of the work done on the private branch will be brought over in this 
> commit. Some of the MPI-specific code that allows recovery from process 
> failure on-the-fly will be committed separately at a later date. This commit 
> provides the foundation for ORTE stabilization that can be built upon to 
> provide OMPI layer stability in the future.
> 
> This commit significantly modifies the ORTE ErrMgr framework to support those 
> advanced recovery operations. The ErrMgr public interface has been preserved 
> since it is used in various places throughout the codebase, and should 
> continue to be used as normal. The ErrMgr framework has been internally 
> redesigned to better support multiple strategies for responding to failures 
> (represents a merge of the old ErrMgr and the RecoS framework, into the 
> ErrMgr 3.0 component interface). The default (base) mode will continue to 
> work exactly the same as today, aborting the job when a failure occurs. 
> However, if the user elects to enable recovery then one or more ErrMgr 
> components will be activated to determine the recovery policy for the job.
> 
> We have created a public repo (reference branch, above) with the code to be 
> merged into the trunk (r22815). Please feel free to check it out and test it.
> 
> NOTE: The new recovery capability is only active if the user elects to use it 
> by setting the MCA parameter errmgr_base_enable_recovery to '1'.
> 
> NOTE: More ErrMgr recovery components will be coming online in the near 
> future, currently this branch only includes the 'orcm' module for ORTE 
> process recovery (not MPI processes). If you want to experiment with this 
> feature, below are the MCA parameters that you will need to get started.
>> #################################
>> plm=rsh
>> rmaps=resilient
>> routed=cm
>> errmgr_base_enable_recovery=1
>> #################################
> 
> Comments, suggestions, and corrections are welcome!
> 
> 
> 
> On Mar 10, 2010, at 2:22 PM, Josh Hursey wrote:
> 
>> Wesley,
>> 
>> Thanks for catching that oversight. Below are the MCA parameters that you 
>> should need at the moment:
>> #####################################
>> # Use the C/R Process Migration Recovery Supervisor
>> recos_base_enable=1
>> # Only use the 'rsh' launcher, other launchers will be supported later
>> plm=rsh
>> # The resilient mapper knows how to use RecoS and deal with recovering procs
>> rmaps=resilient
>> # 'cm' component is the only one that can handle failures at the moment
>> routed=cm
>> #####################################
>> 
>> Let me know if you have any troubles.
>> 
>> -- Josh
>> 
>> On Mar 10, 2010, at 10:36 AM, Wesley Bland wrote:
>> 
>>> Josh,
>>> 
>>> You mentioned some MCA parameters that you would include in the email, but 
>>> I don't see those parameters anywhere.  Could you please put those in here 
>>> to make testing easier for people.
>>> 
>>> Wesley
>>> 
>>> On Wed, Mar 10, 2010 at 1:26 PM, Josh Hursey <jjhur...@open-mpi.org> wrote:
>>> Yesterday evening George, Thomas and I discussed some of their concerns 
>>> about this RFC at the MPI Forum meeting. After the discussion, we seemed to 
>>> be in agreement that the RecoS framework is a good idea and the concepts 
>>> and fixes in this RFC should move forward with a couple of notes:
>>> 
>>> - They wanted to test the branch a bit more over the next couple of days. 
>>> Some MCA parameters that you will need are at the bottom of this message.
>>> 
>>> - Reiterate that this RFC only addresses ORTE stability, not OMPI 
>>> stability. The OMPI stability extension is a second step for the line of 
>>> work, and should/will fit in nicely with the RecoS framework being proposed 
>>> in this RFC. The OMPI layer stability will require a significant amount of 
>>> work, but the RecoS framework will provide the ORTE layer stability that is 
>>> required as a foundation for OMPI layer stability in the future.
>>> 
>>> - The purpose of the ErrMgr becomes slightly unclear with the addition of 
>>> the RecoS framework, since both are focused on responding to faults in the 
>>> system (and RecoS, when enabled, overrides most/all of the ErrMgr 
>>> functionality). Should the RecoS framework be merged with the ErrMgr 
>>> framework to create a new ErrMgr interface?
>>> 
>>> We are typing to decide if we should merge these frameworks, but at this 
>>> point we are interested in hearing how other developers feel about merging 
>>> the ErrMgr and RecoS frameworks, which would change the ErrMgr API. Are 
>>> there any developers out there that are developing ErrMgr components, or 
>>> are using any particular features of the existing ErrMgr framework that 
>>> they would like to see preserved in the next revision. By default, the 
>>> existing default abort behavior of the ErrMgr framework will be preserved, 
>>> so the user will have to 'opt-in' to any fault recovery capabilities.
>>> 
>>> So we are continuing the discussion a bit more off-list, and will return to 
>>> the list with an updated RFC (and possibly a new branch) soon (hopefully 
>>> end of the week/early next week). I would like to briefly discuss this RFC 
>>> at the Open MPI teleconf next Tuesday.
>>> 
>>> -- Josh
>>> 
>>> On Feb 26, 2010, at 8:06 AM, Josh Hursey wrote:
>>> 
>>>> Sounds good to me.
>>>> 
>>>> For those casually following this RFC let me summarize its current state.
>>>> 
>>>> Josh and George (and anyone else that wishes to participate attending the 
>>>> forum) will meet sometime at the next MPI Forum meeting (March 8-10). I 
>>>> will post any relevant notes from this meeting back to the list 
>>>> afterwards. So the RFC is on hold pending the outcome of that meeting. For 
>>>> those developers interested in this RFC that will not be able to attend, 
>>>> feel free to continue using this thread for discussion.
>>>> 
>>>> Thanks,
>>>> Josh
>>>> 
>>>> On Feb 26, 2010, at 6:09 AM, George Bosilca wrote:
>>>> 
>>>>> 
>>>>> On Feb 26, 2010, at 01:50 , Josh Hursey wrote:
>>>>> 
>>>>>> Any of those options are fine with me. I was thinking that if you wanted 
>>>>>> to talk sooner, we might be able to help explain our intentions with 
>>>>>> this framework a bit better. I figure that the framework interface will 
>>>>>> change a bit as we all advance and incorporate our various techniques 
>>>>>> into it. I think that the current interface is a good first step, but 
>>>>>> there are certainly many more steps to come.
>>>>>> 
>>>>>> I am fine delaying this code a bit, just not too long. Meeting at the 
>>>>>> forum for a while might be a good option (we could probably even arrange 
>>>>>> to call in others if you wanted).
>>>>> 
>>>>> Sounds good, let do this.
>>>>> 
>>>>> Thanks,
>>>>> george.
>>>>> 
>>>>>> 
>>>>>> Cheers,
>>>>>> Josh
>>>>>> 
>>>>>> On Feb 25, 2010, at 6:45 PM, Ralph Castain wrote:
>>>>>> 
>>>>>>> If Josh is going to be at the forum, perhaps you folks could chat 
>>>>>>> there? Might as well take advantage of being colocated, if possible.
>>>>>>> 
>>>>>>> Otherwise, I'm available pretty much any time. I can't contribute much 
>>>>>>> about the MPI recovery issues, but can contribute to the RTE issues if 
>>>>>>> that helps.
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Feb 25, 2010 at 7:39 PM, George Bosilca <bosi...@eecs.utk.edu> 
>>>>>>> wrote:
>>>>>>> Josh,
>>>>>>> 
>>>>>>> Next week is a little bit too early as will need some time to figure 
>>>>>>> out how to integrate with this new framework, and at what extent our 
>>>>>>> code and requirements fit into. Then the week after is the MPI Forum. 
>>>>>>> How about on Thursday 11 March?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> george.
>>>>>>> 
>>>>>>> On Feb 25, 2010, at 12:46 , Josh Hursey wrote:
>>>>>>> 
>>>>>>>> Per my previous suggestion, would it be useful to chat on the phone 
>>>>>>>> early next week about our various strategies?
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to