This has been committed in r22872.
Let me know if you see any problems with the commit.
-- Josh
On Mar 23, 2010, at 7:57 AM, Joshua Hursey wrote:
Just a reminder that this RFC will go into the trunk this evening
unless there are strong objections.
We intend to let this soak for a few days
Just a reminder that this RFC will go into the trunk this evening unless there
are strong objections.
We intend to let this soak for a few days then bring it over to the 1.5 series
(after the 1.5.0 release).
-- Josh
On Mar 15, 2010, at 9:26 AM, Josh Hursey wrote:
> (Updated RFC, per offline d
(Updated RFC, per offline discussion)
WHAT: Merge a tmp branch for fault recovery development into the OMPI
trunk
WHY: Bring over work done by Josh and Ralph to extend OMPI's fault
recovery capabilities
WHERE: Impacts a number of ORTE files and a ORTE ErrMgr framework
TIMEOUT: Barring ob
Wesley,
Thanks for catching that oversight. Below are the MCA parameters that you
should need at the moment:
#
# Use the C/R Process Migration Recovery Supervisor
recos_base_enable=1
# Only use the 'rsh' launcher, other launchers will be supported later
plm=rsh
Josh,
You mentioned some MCA parameters that you would include in the email, but I
don't see those parameters anywhere. Could you please put those in here to
make testing easier for people.
Wesley
On Wed, Mar 10, 2010 at 1:26 PM, Josh Hursey wrote:
> Yesterday evening George, Thomas and I dis
Yesterday evening George, Thomas and I discussed some of their concerns about
this RFC at the MPI Forum meeting. After the discussion, we seemed to be in
agreement that the RecoS framework is a good idea and the concepts and fixes in
this RFC should move forward with a couple of notes:
- They
Sounds good to me.
For those casually following this RFC let me summarize its current state.
Josh and George (and anyone else that wishes to participate attending the
forum) will meet sometime at the next MPI Forum meeting (March 8-10). I will
post any relevant notes from this meeting back to t
On Feb 26, 2010, at 01:50 , Josh Hursey wrote:
> Any of those options are fine with me. I was thinking that if you wanted to
> talk sooner, we might be able to help explain our intentions with this
> framework a bit better. I figure that the framework interface will change a
> bit as we all ad
Any of those options are fine with me. I was thinking that if you wanted to
talk sooner, we might be able to help explain our intentions with this
framework a bit better. I figure that the framework interface will change a bit
as we all advance and incorporate our various techniques into it. I t
If Josh is going to be at the forum, perhaps you folks could chat there?
Might as well take advantage of being colocated, if possible.
Otherwise, I'm available pretty much any time. I can't contribute much about
the MPI recovery issues, but can contribute to the RTE issues if that helps.
On Thu,
Josh,
Next week is a little bit too early as will need some time to figure out how to
integrate with this new framework, and at what extent our code and requirements
fit into. Then the week after is the MPI Forum. How about on Thursday 11 March?
Thanks,
george.
On Feb 25, 2010, at 12:46
Just to add to Josh's comment: I am working now on recovering from HNP
failure as well. Should have that in a month or so.
On Thu, Feb 25, 2010 at 10:46 AM, Josh Hursey wrote:
>
> On Feb 25, 2010, at 8:32 AM, George Bosilca wrote:
>
> >
> > On Feb 25, 2010, at 11:16 , Leonardo Fialho wrote:
> >
I believe you are thinking parallel to what Josh and I have been doing, and
slightly different to the UTK approach. The "orcm" method follows what you
describe: we maintain operation on the current remaining nodes, see if we
can use another new node to replace the failed one, and redistribute the
a
On Feb 25, 2010, at 8:32 AM, George Bosilca wrote:
>
> On Feb 25, 2010, at 11:16 , Leonardo Fialho wrote:
>
>> Hum... I'm really afraid about this. I understand your choice since it is
>> really a good solution for fail/stop/restart behaviour, but looking from the
>> fail/recovery side, can y
Hi George,
>> Hum... I'm really afraid about this. I understand your choice since it is
>> really a good solution for fail/stop/restart behaviour, but looking from the
>> fail/recovery side, can you envision some alternative for the orted's
>> reconfiguration on the fly?
>
> I don't see why th
On Feb 25, 2010, at 11:16 , Leonardo Fialho wrote:
> Hum... I'm really afraid about this. I understand your choice since it is
> really a good solution for fail/stop/restart behaviour, but looking from the
> fail/recovery side, can you envision some alternative for the orted's
> reconfiguratio
Hi Ralph and Josh,
>>> Regarding to the schema represented by the picture, I didn't understand the
>>> RecoS' behaviour in a node failure situation.
>>>
>>> In this case, will mpirun consider the daemon failure as a normal proc
>>> failure? If it is correct, should mpirun update the global proc
On Feb 25, 2010, at 4:38 AM, Ralph Castain wrote:
>
> On Feb 25, 2010, at 1:41 AM, Leonardo Fialho wrote:
>
>> Hi Ralph,
>>
>> Very interesting the "composite framework" idea.
>
> Josh is the force behind that idea :-)
It solves a pretty interesting little problem. Its utility will really sh
On Feb 23, 2010, at 3:00 PM, Ralph Castain wrote:
>
> On Feb 23, 2010, at 3:32 PM, George Bosilca wrote:
>
>> Ralph, Josh,
>>
>> We have some comments about the API of the new framework, mostly
>> clarifications needed to better understand how this new framework is
>> supposed to be used. An
On Feb 25, 2010, at 1:41 AM, Leonardo Fialho wrote:
> Hi Ralph,
>
> Very interesting the "composite framework" idea.
Josh is the force behind that idea :-)
> Regarding to the schema represented by the picture, I didn't understand the
> RecoS' behaviour in a node failure situation.
>
> In thi
Hi Ralph,
Very interesting the "composite framework" idea. Regarding to the schema
represented by the picture, I didn't understand the RecoS' behaviour in a node
failure situation.
In this case, will mpirun consider the daemon failure as a normal proc failure?
If it is correct, should mpirun u
Hi Ralph,
Very interesting the "composite framework" idea. Regarding to the schema
represented by the picture, I didn't understand the RecoS' behaviour in a node
failure situation.
In this case, will mpirun consider the daemon failure as a normal proc failure?
If it is correct, should mpirun u
Hi George et al
I have begun documenting the RecoS operation on the OMPI wiki:
https://svn.open-mpi.org/trac/ompi/wiki/RecoS
I'll continue to work on this over the next few days by adding a section
explaining what was changed outside of the new framework to make it all work.
In addition, I am
On Feb 23, 2010, at 3:32 PM, George Bosilca wrote:
> Ralph, Josh,
>
> We have some comments about the API of the new framework, mostly
> clarifications needed to better understand how this new framework is supposed
> to be used. And a request for a deadline extension, to delay the code merge
Ralph, Josh,
We have some comments about the API of the new framework, mostly clarifications
needed to better understand how this new framework is supposed to be used. And
a request for a deadline extension, to delay the code merge from the Recos
branch in the trunk by a week.
We have our own
WHAT: Merge a tmp branch for fault recovery development into the OMPI trunk
WHY: Bring over work done by Josh and Ralph to extend OMPI's fault recovery
capabilities
WHERE: Impacts a number of ORTE files and a small number of OMPI files
TIMEOUT: Barring objections and/or requests for delay, the
26 matches
Mail list logo