Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-03-23 Thread Josh Hursey
This has been committed in r22872. Let me know if you see any problems with the commit. -- Josh On Mar 23, 2010, at 7:57 AM, Joshua Hursey wrote: Just a reminder that this RFC will go into the trunk this evening unless there are strong objections. We intend to let this soak for a few days

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-03-23 Thread Joshua Hursey
Just a reminder that this RFC will go into the trunk this evening unless there are strong objections. We intend to let this soak for a few days then bring it over to the 1.5 series (after the 1.5.0 release). -- Josh On Mar 15, 2010, at 9:26 AM, Josh Hursey wrote: > (Updated RFC, per offline d

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-03-15 Thread Josh Hursey
(Updated RFC, per offline discussion) WHAT: Merge a tmp branch for fault recovery development into the OMPI trunk WHY: Bring over work done by Josh and Ralph to extend OMPI's fault recovery capabilities WHERE: Impacts a number of ORTE files and a ORTE ErrMgr framework TIMEOUT: Barring ob

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-03-10 Thread Josh Hursey
Wesley, Thanks for catching that oversight. Below are the MCA parameters that you should need at the moment: # # Use the C/R Process Migration Recovery Supervisor recos_base_enable=1 # Only use the 'rsh' launcher, other launchers will be supported later plm=rsh

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-03-10 Thread Wesley Bland
Josh, You mentioned some MCA parameters that you would include in the email, but I don't see those parameters anywhere. Could you please put those in here to make testing easier for people. Wesley On Wed, Mar 10, 2010 at 1:26 PM, Josh Hursey wrote: > Yesterday evening George, Thomas and I dis

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-03-10 Thread Josh Hursey
Yesterday evening George, Thomas and I discussed some of their concerns about this RFC at the MPI Forum meeting. After the discussion, we seemed to be in agreement that the RecoS framework is a good idea and the concepts and fixes in this RFC should move forward with a couple of notes: - They

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-26 Thread Josh Hursey
Sounds good to me. For those casually following this RFC let me summarize its current state. Josh and George (and anyone else that wishes to participate attending the forum) will meet sometime at the next MPI Forum meeting (March 8-10). I will post any relevant notes from this meeting back to t

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-26 Thread George Bosilca
On Feb 26, 2010, at 01:50 , Josh Hursey wrote: > Any of those options are fine with me. I was thinking that if you wanted to > talk sooner, we might be able to help explain our intentions with this > framework a bit better. I figure that the framework interface will change a > bit as we all ad

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-26 Thread Josh Hursey
Any of those options are fine with me. I was thinking that if you wanted to talk sooner, we might be able to help explain our intentions with this framework a bit better. I figure that the framework interface will change a bit as we all advance and incorporate our various techniques into it. I t

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-25 Thread Ralph Castain
If Josh is going to be at the forum, perhaps you folks could chat there? Might as well take advantage of being colocated, if possible. Otherwise, I'm available pretty much any time. I can't contribute much about the MPI recovery issues, but can contribute to the RTE issues if that helps. On Thu,

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-25 Thread George Bosilca
Josh, Next week is a little bit too early as will need some time to figure out how to integrate with this new framework, and at what extent our code and requirements fit into. Then the week after is the MPI Forum. How about on Thursday 11 March? Thanks, george. On Feb 25, 2010, at 12:46

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-25 Thread Ralph Castain
Just to add to Josh's comment: I am working now on recovering from HNP failure as well. Should have that in a month or so. On Thu, Feb 25, 2010 at 10:46 AM, Josh Hursey wrote: > > On Feb 25, 2010, at 8:32 AM, George Bosilca wrote: > > > > > On Feb 25, 2010, at 11:16 , Leonardo Fialho wrote: > >

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-25 Thread Ralph Castain
I believe you are thinking parallel to what Josh and I have been doing, and slightly different to the UTK approach. The "orcm" method follows what you describe: we maintain operation on the current remaining nodes, see if we can use another new node to replace the failed one, and redistribute the a

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-25 Thread Josh Hursey
On Feb 25, 2010, at 8:32 AM, George Bosilca wrote: > > On Feb 25, 2010, at 11:16 , Leonardo Fialho wrote: > >> Hum... I'm really afraid about this. I understand your choice since it is >> really a good solution for fail/stop/restart behaviour, but looking from the >> fail/recovery side, can y

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-25 Thread Leonardo Fialho
Hi George, >> Hum... I'm really afraid about this. I understand your choice since it is >> really a good solution for fail/stop/restart behaviour, but looking from the >> fail/recovery side, can you envision some alternative for the orted's >> reconfiguration on the fly? > > I don't see why th

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-25 Thread George Bosilca
On Feb 25, 2010, at 11:16 , Leonardo Fialho wrote: > Hum... I'm really afraid about this. I understand your choice since it is > really a good solution for fail/stop/restart behaviour, but looking from the > fail/recovery side, can you envision some alternative for the orted's > reconfiguratio

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-25 Thread Leonardo Fialho
Hi Ralph and Josh, >>> Regarding to the schema represented by the picture, I didn't understand the >>> RecoS' behaviour in a node failure situation. >>> >>> In this case, will mpirun consider the daemon failure as a normal proc >>> failure? If it is correct, should mpirun update the global proc

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-25 Thread Josh Hursey
On Feb 25, 2010, at 4:38 AM, Ralph Castain wrote: > > On Feb 25, 2010, at 1:41 AM, Leonardo Fialho wrote: > >> Hi Ralph, >> >> Very interesting the "composite framework" idea. > > Josh is the force behind that idea :-) It solves a pretty interesting little problem. Its utility will really sh

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-25 Thread Josh Hursey
On Feb 23, 2010, at 3:00 PM, Ralph Castain wrote: > > On Feb 23, 2010, at 3:32 PM, George Bosilca wrote: > >> Ralph, Josh, >> >> We have some comments about the API of the new framework, mostly >> clarifications needed to better understand how this new framework is >> supposed to be used. An

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-25 Thread Ralph Castain
On Feb 25, 2010, at 1:41 AM, Leonardo Fialho wrote: > Hi Ralph, > > Very interesting the "composite framework" idea. Josh is the force behind that idea :-) > Regarding to the schema represented by the picture, I didn't understand the > RecoS' behaviour in a node failure situation. > > In thi

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-25 Thread Leonardo Fialho
Hi Ralph, Very interesting the "composite framework" idea. Regarding to the schema represented by the picture, I didn't understand the RecoS' behaviour in a node failure situation. In this case, will mpirun consider the daemon failure as a normal proc failure? If it is correct, should mpirun u

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-25 Thread Leonardo Fialho
Hi Ralph, Very interesting the "composite framework" idea. Regarding to the schema represented by the picture, I didn't understand the RecoS' behaviour in a node failure situation. In this case, will mpirun consider the daemon failure as a normal proc failure? If it is correct, should mpirun u

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-25 Thread Ralph Castain
Hi George et al I have begun documenting the RecoS operation on the OMPI wiki: https://svn.open-mpi.org/trac/ompi/wiki/RecoS I'll continue to work on this over the next few days by adding a section explaining what was changed outside of the new framework to make it all work. In addition, I am

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-23 Thread Ralph Castain
On Feb 23, 2010, at 3:32 PM, George Bosilca wrote: > Ralph, Josh, > > We have some comments about the API of the new framework, mostly > clarifications needed to better understand how this new framework is supposed > to be used. And a request for a deadline extension, to delay the code merge

Re: [OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-23 Thread George Bosilca
Ralph, Josh, We have some comments about the API of the new framework, mostly clarifications needed to better understand how this new framework is supposed to be used. And a request for a deadline extension, to delay the code merge from the Recos branch in the trunk by a week. We have our own

[OMPI devel] RFC: Merge tmp fault recovery branch into trunk

2010-02-19 Thread Ralph Castain
WHAT: Merge a tmp branch for fault recovery development into the OMPI trunk WHY: Bring over work done by Josh and Ralph to extend OMPI's fault recovery capabilities WHERE: Impacts a number of ORTE files and a small number of OMPI files TIMEOUT: Barring objections and/or requests for delay, the