Wesley, Thanks for catching that oversight. Below are the MCA parameters that you should need at the moment: ##################################### # Use the C/R Process Migration Recovery Supervisor recos_base_enable=1 # Only use the 'rsh' launcher, other launchers will be supported later plm=rsh # The resilient mapper knows how to use RecoS and deal with recovering procs rmaps=resilient # 'cm' component is the only one that can handle failures at the moment routed=cm #####################################
Let me know if you have any troubles. -- Josh On Mar 10, 2010, at 10:36 AM, Wesley Bland wrote: > Josh, > > You mentioned some MCA parameters that you would include in the email, but I > don't see those parameters anywhere. Could you please put those in here to > make testing easier for people. > > Wesley > > On Wed, Mar 10, 2010 at 1:26 PM, Josh Hursey <jjhur...@open-mpi.org> wrote: > Yesterday evening George, Thomas and I discussed some of their concerns about > this RFC at the MPI Forum meeting. After the discussion, we seemed to be in > agreement that the RecoS framework is a good idea and the concepts and fixes > in this RFC should move forward with a couple of notes: > > - They wanted to test the branch a bit more over the next couple of days. > Some MCA parameters that you will need are at the bottom of this message. > > - Reiterate that this RFC only addresses ORTE stability, not OMPI stability. > The OMPI stability extension is a second step for the line of work, and > should/will fit in nicely with the RecoS framework being proposed in this > RFC. The OMPI layer stability will require a significant amount of work, but > the RecoS framework will provide the ORTE layer stability that is required as > a foundation for OMPI layer stability in the future. > > - The purpose of the ErrMgr becomes slightly unclear with the addition of > the RecoS framework, since both are focused on responding to faults in the > system (and RecoS, when enabled, overrides most/all of the ErrMgr > functionality). Should the RecoS framework be merged with the ErrMgr > framework to create a new ErrMgr interface? > > We are typing to decide if we should merge these frameworks, but at this > point we are interested in hearing how other developers feel about merging > the ErrMgr and RecoS frameworks, which would change the ErrMgr API. Are there > any developers out there that are developing ErrMgr components, or are using > any particular features of the existing ErrMgr framework that they would like > to see preserved in the next revision. By default, the existing default abort > behavior of the ErrMgr framework will be preserved, so the user will have to > 'opt-in' to any fault recovery capabilities. > > So we are continuing the discussion a bit more off-list, and will return to > the list with an updated RFC (and possibly a new branch) soon (hopefully end > of the week/early next week). I would like to briefly discuss this RFC at the > Open MPI teleconf next Tuesday. > > -- Josh > > On Feb 26, 2010, at 8:06 AM, Josh Hursey wrote: > > > Sounds good to me. > > > > For those casually following this RFC let me summarize its current state. > > > > Josh and George (and anyone else that wishes to participate attending the > > forum) will meet sometime at the next MPI Forum meeting (March 8-10). I > > will post any relevant notes from this meeting back to the list afterwards. > > So the RFC is on hold pending the outcome of that meeting. For those > > developers interested in this RFC that will not be able to attend, feel > > free to continue using this thread for discussion. > > > > Thanks, > > Josh > > > > On Feb 26, 2010, at 6:09 AM, George Bosilca wrote: > > > >> > >> On Feb 26, 2010, at 01:50 , Josh Hursey wrote: > >> > >>> Any of those options are fine with me. I was thinking that if you wanted > >>> to talk sooner, we might be able to help explain our intentions with this > >>> framework a bit better. I figure that the framework interface will change > >>> a bit as we all advance and incorporate our various techniques into it. I > >>> think that the current interface is a good first step, but there are > >>> certainly many more steps to come. > >>> > >>> I am fine delaying this code a bit, just not too long. Meeting at the > >>> forum for a while might be a good option (we could probably even arrange > >>> to call in others if you wanted). > >> > >> Sounds good, let do this. > >> > >> Thanks, > >> george. > >> > >>> > >>> Cheers, > >>> Josh > >>> > >>> On Feb 25, 2010, at 6:45 PM, Ralph Castain wrote: > >>> > >>>> If Josh is going to be at the forum, perhaps you folks could chat there? > >>>> Might as well take advantage of being colocated, if possible. > >>>> > >>>> Otherwise, I'm available pretty much any time. I can't contribute much > >>>> about the MPI recovery issues, but can contribute to the RTE issues if > >>>> that helps. > >>>> > >>>> > >>>> On Thu, Feb 25, 2010 at 7:39 PM, George Bosilca <bosi...@eecs.utk.edu> > >>>> wrote: > >>>> Josh, > >>>> > >>>> Next week is a little bit too early as will need some time to figure out > >>>> how to integrate with this new framework, and at what extent our code > >>>> and requirements fit into. Then the week after is the MPI Forum. How > >>>> about on Thursday 11 March? > >>>> > >>>> Thanks, > >>>> george. > >>>> > >>>> On Feb 25, 2010, at 12:46 , Josh Hursey wrote: > >>>> > >>>>> Per my previous suggestion, would it be useful to chat on the phone > >>>>> early next week about our various strategies? > >>>> > >>>> > >>>> _______________________________________________ > >>>> devel mailing list > >>>> de...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>> > >>>> _______________________________________________ > >>>> devel mailing list > >>>> de...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>> > >>> > >>> _______________________________________________ > >>> devel mailing list > >>> de...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> > >> > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel