Hi George et al
I have begun documenting the RecoS operation on the OMPI wiki:
https://svn.open-mpi.org/trac/ompi/wiki/RecoS
I'll continue to work on this over the next few days by adding a section
explaining what was changed outside of the new framework to make it all work.
In addition, I am
Hi Ralph,
Very interesting the "composite framework" idea. Regarding to the schema
represented by the picture, I didn't understand the RecoS' behaviour in a node
failure situation.
In this case, will mpirun consider the daemon failure as a normal proc failure?
If it is correct, should mpirun u
Hi Ralph,
Very interesting the "composite framework" idea. Regarding to the schema
represented by the picture, I didn't understand the RecoS' behaviour in a node
failure situation.
In this case, will mpirun consider the daemon failure as a normal proc failure?
If it is correct, should mpirun u
Thanks a lot! i got it.Could you introduce some more materials for me to get
better understood of the following functions:
(1):/ompi/mca/pml/ob1/pml_ob1.c/mca_pml_ob1_add_procs
(2):/ompi/mca/bml/r2/bml_r2.c/mca_bml_r2_add_procs
(3):/ompi/mca/btl/tcp/btl_tcp.c/mca_btl_tcp_add_procs
especially the se
On Feb 25, 2010, at 1:41 AM, Leonardo Fialho wrote:
> Hi Ralph,
>
> Very interesting the "composite framework" idea.
Josh is the force behind that idea :-)
> Regarding to the schema represented by the picture, I didn't understand the
> RecoS' behaviour in a node failure situation.
>
> In thi
On Feb 25, 2010, at 7:14 AM, hu yaohui wrote:
> Thanks a lot! i got it.Could you introduce some more materials for me to get
> better understood of the following functions:
> (1):/ompi/mca/pml/ob1/pml_ob1.c/mca_pml_ob1_add_procs
This is just the OB1 function to add new peer processes. It's call
On Feb 23, 2010, at 3:00 PM, Ralph Castain wrote:
>
> On Feb 23, 2010, at 3:32 PM, George Bosilca wrote:
>
>> Ralph, Josh,
>>
>> We have some comments about the API of the new framework, mostly
>> clarifications needed to better understand how this new framework is
>> supposed to be used. An
On Feb 25, 2010, at 4:38 AM, Ralph Castain wrote:
>
> On Feb 25, 2010, at 1:41 AM, Leonardo Fialho wrote:
>
>> Hi Ralph,
>>
>> Very interesting the "composite framework" idea.
>
> Josh is the force behind that idea :-)
It solves a pretty interesting little problem. Its utility will really sh
Hi Ralph and Josh,
>>> Regarding to the schema represented by the picture, I didn't understand the
>>> RecoS' behaviour in a node failure situation.
>>>
>>> In this case, will mpirun consider the daemon failure as a normal proc
>>> failure? If it is correct, should mpirun update the global proc
On Feb 25, 2010, at 11:16 , Leonardo Fialho wrote:
> Hum... I'm really afraid about this. I understand your choice since it is
> really a good solution for fail/stop/restart behaviour, but looking from the
> fail/recovery side, can you envision some alternative for the orted's
> reconfiguratio
Hi George,
>> Hum... I'm really afraid about this. I understand your choice since it is
>> really a good solution for fail/stop/restart behaviour, but looking from the
>> fail/recovery side, can you envision some alternative for the orted's
>> reconfiguration on the fly?
>
> I don't see why th
Ralph,
We'd like this to be able to support attaching a debugger to the application.
Would it be difficult to provide? We don't need the information all at once,
each PID could be sent as the process launches (as long as the XML is correctly
formatted) if that makes it any easier.
Greg
On Feb
On Feb 25, 2010, at 8:32 AM, George Bosilca wrote:
>
> On Feb 25, 2010, at 11:16 , Leonardo Fialho wrote:
>
>> Hum... I'm really afraid about this. I understand your choice since it is
>> really a good solution for fail/stop/restart behaviour, but looking from the
>> fail/recovery side, can y
Have you looked at orte-ps? It contains all the information you'll need to
attach a debugger to a already running application.
Ashley,
On 25 Feb 2010, at 17:43, Greg Watson wrote:
> Ralph,
>
> We'd like this to be able to support attaching a debugger to the application.
> Would it be diffic
WHAT: Bump minimum required versions of GNU autotools up to modern versions. I
suggest the following, but could be talked down a version or two:
Autoconf: 2.65
Automake: 1.11.1
Libtool: 2.2.6b
WHY: Stop carrying patches and workarounds for old versions.
WHERE: autogen.sh, make
I think our last set of minimums was based on being able to use RHEL4 out of
the box. Updating to whatever ships with RHEL5 probably makes sense, but I
think that still leaves you at a LT 1.5.x release. Being higher than that
requires new Autotools, which seems like asking for trouble.
Brian
I believe you are thinking parallel to what Josh and I have been doing, and
slightly different to the UTK approach. The "orcm" method follows what you
describe: we maintain operation on the current remaining nodes, see if we
can use another new node to replace the failed one, and redistribute the
a
Just to add to Josh's comment: I am working now on recovering from HNP
failure as well. Should have that in a month or so.
On Thu, Feb 25, 2010 at 10:46 AM, Josh Hursey wrote:
>
> On Feb 25, 2010, at 8:32 AM, George Bosilca wrote:
>
> >
> > On Feb 25, 2010, at 11:16 , Leonardo Fialho wrote:
> >
Easy to do. I'll dump all the pids at the same time when the launch
completes - effectively, it will be at the same point used by other
debuggers to attach.
Have it for you in the trunk this weekend. Can you suggest an xml format you
would like? Otherwise, I'll just use the current proc output (us
Josh,
Next week is a little bit too early as will need some time to figure out how to
integrate with this new framework, and at what extent our code and requirements
fit into. Then the week after is the MPI Forum. How about on Thursday 11 March?
Thanks,
george.
On Feb 25, 2010, at 12:46
If Josh is going to be at the forum, perhaps you folks could chat there?
Might as well take advantage of being colocated, if possible.
Otherwise, I'm available pretty much any time. I can't contribute much about
the MPI recovery issues, but can contribute to the RTE issues if that helps.
On Thu,
21 matches
Mail list logo