When the proc restarts, it calls orte_routed.init_routes. If you look in
routed cm, you should see a call to "register_sync" - this is where the proc
sends a message to the local daemon, allowing it to "learn" the port/address
where the proc resides.
I've done this. I had a problem because when i
On Apr 8, 2011, at 9:02 AM, Hugo Meyer wrote:
> Thanks Ralph.
>
> I found a set_lifeline with that i think i solve that error, but, now i'm
> dealing with another.
>
> [clus3:32001] [[44269,0],2] -> [[44269,1],1] (node: node3) oob-tcp: Number of
> attempts to create TCP connection has been ex
Thanks Ralph.
I found a set_lifeline with that i think i solve that error, but, now i'm
dealing with another.
[clus3:32001] [[44269,0],2] -> [[44269,1],1] (node: node3) oob-tcp: Number
of attempts to create TCP connection has been exceeded. Can not communicate
with peer
Open MPI Error Report:[32
Looks like the lifeline is still pointing to its old daemon instead of being
updated to the new one. Look in orte/mca/routed/cm/routed_cm.c - should be
something in there that updates the lifeline during restart of a checkpoint.
On Apr 6, 2011, at 7:50 AM, Hugo Meyer wrote:
> Hi all.
>
> I co
Hi all.
I corrected the error with the port. The mistake was because he tried to
start theprocess back and the ports are static, the process was taking a port
where an app was already running.
Initially, the process was running on [[65478,0],1] and then it moves to
[[65478,0],2].
So now i get t
Hello Ralph and @ll.
Ralph, by following your recomendations i've already restart the process in
another node from his checkpoint. But now i'm having a small problem with
the oob_tcp. There is the output:
odls_base_default_fns:SETEANDO BLCR CONTEXT
CKPT-FILE: /tmp/radic/1/ompi_blcr_context.13374
Ok Ralph.
Thanks a lot, i will resend this message with a new subject.
Best Regards.
Hugo
2011/3/31 Ralph Castain
> Sorry - should have included the devel list when I sent this.
>
>
> On Mar 30, 2011, at 6:11 PM, Ralph Castain wrote:
>
> I'm not the expert on this area - Josh is, so I'll defer
Sorry - should have included the devel list when I sent this.
On Mar 30, 2011, at 6:11 PM, Ralph Castain wrote:
> I'm not the expert on this area - Josh is, so I'll defer to him. I did take a
> quick glance at the sstore framework, though, and it looks like there are
> some params you could se
Hello again.
I'm working in the launch code to handle my checkpoints, but i'm a little
stuck in how to set the path to my checkpoint and the executable
(ompi_blcr_context.PID). I take a look at the code in
odls_base_default_fns.c and this piece of code took my attention:
#if OPAL_ENABLE_FT_CR ==
Thanks Ralph.
I have finished the (a) point, and now its working, now i have to work to
relaunch from my checkpoint as you said.
Best regards.
Hugo Meyer
2011/3/29 Ralph Castain
> The resilient mapper -only- works on procs being restarted - it cannot map
> a job for its initial launch. You sho
The resilient mapper -only- works on procs being restarted - it cannot map a
job for its initial launch. You shouldn't set any rmaps flag and things will
work correctly - the default round-robin mapper will map the initial launch,
and then the resilient mapper will handle restarts.
On Mar 29,
Ralph.
I'm having a problem when i try to select the rmaps resilient to be used:
/home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -v -np 4 --hostfile
../hostfile --bynode -mca rmaps resilient -mca vprotocol receiver -mca plm
rsh -mca routed cm ./coll 6 10 2>out.txt
I get this as error:
[c
Ok Ralph.
Thanks a lot for your help, i will do as you said and then let you know how
it goes.
Best Regards.
Hugo Meyer
2011/3/25 Ralph Castain
>
> On Mar 25, 2011, at 10:48 AM, Hugo Meyer wrote:
>
> From what you've described before, I suspect all you'll need to do is add
>> some code in ort
On Mar 25, 2011, at 10:48 AM, Hugo Meyer wrote:
> From what you've described before, I suspect all you'll need to do is add
> some code in orte/mca/odls/base/odls_base_default_fns.c that (a) checks to
> see if a process in the launch message is being relocated (the
> construct_child_list code
>
> From what you've described before, I suspect all you'll need to do is add
> some code in orte/mca/odls/base/odls_base_default_fns.c that (a) checks to
> see if a process in the launch message is being relocated (the
> construct_child_list code does that already), and then (b) sends the
> requir
On Mar 24, 2011, at 3:57 PM, Hugo Meyer wrote:
> 2011/3/24 Ralph Castain
> You really don't want to do it that way - you'll create a major confusion in
> mpirun and the other daemons about who is where. Have you looked at the code
> in orte/mca/errmgr/hnp/errmgr_hnp.c, line 1573 and following?
2011/3/24 Ralph Castain
> You really don't want to do it that way - you'll create a major confusion
> in mpirun and the other daemons about who is where. Have you looked at the
> code in orte/mca/errmgr/hnp/errmgr_hnp.c, line 1573 and following?
>
I did not look at that, but i will do it right no
You really don't want to do it that way - you'll create a major confusion in
mpirun and the other daemons about who is where. Have you looked at the code in
orte/mca/errmgr/hnp/errmgr_hnp.c, line 1573 and following?
The ability to relocate a failed child process is already in the trunk - it
onl
Hello @ll.
I'm trying to restart a child that has failed, now i'm catching the failed
child in the errmgr and then i'm packing the child and sending it to another
node who has to "adopt" it. Is there any way to do this with te actual
implementation? something like add_child. Because the i will hav
19 matches
Mail list logo