When the proc restarts, it calls orte_routed.init_routes. If you look in
routed cm, you should see a call to "register_sync" - this is where the proc
sends a message to the local daemon, allowing it to "learn" the port/address
where the proc resides.
I've done this. I had a problem because when i
On Apr 8, 2011, at 9:02 AM, Hugo Meyer wrote:
> Thanks Ralph.
>
> I found a set_lifeline with that i think i solve that error, but, now i'm
> dealing with another.
>
> [clus3:32001] [[44269,0],2] -> [[44269,1],1] (node: node3) oob-tcp: Number of
> attempts to create TCP connection has been ex
Thanks Ralph.
I found a set_lifeline with that i think i solve that error, but, now i'm
dealing with another.
[clus3:32001] [[44269,0],2] -> [[44269,1],1] (node: node3) oob-tcp: Number
of attempts to create TCP connection has been exceeded. Can not communicate
with peer
Open MPI Error Report:[32
Looks like the lifeline is still pointing to its old daemon instead of being
updated to the new one. Look in orte/mca/routed/cm/routed_cm.c - should be
something in there that updates the lifeline during restart of a checkpoint.
On Apr 6, 2011, at 7:50 AM, Hugo Meyer wrote:
> Hi all.
>
> I co
Hi all.
I corrected the error with the port. The mistake was because he tried to
start theprocess back and the ports are static, the process was taking a port
where an app was already running.
Initially, the process was running on [[65478,0],1] and then it moves to
[[65478,0],2].
So now i get t
Hello Ralph and @ll.
Ralph, by following your recomendations i've already restart the process in
another node from his checkpoint. But now i'm having a small problem with
the oob_tcp. There is the output:
odls_base_default_fns:SETEANDO BLCR CONTEXT
CKPT-FILE: /tmp/radic/1/ompi_blcr_context.13374
Ok Ralph.
Thanks a lot, i will resend this message with a new subject.
Best Regards.
Hugo
2011/3/31 Ralph Castain
> Sorry - should have included the devel list when I sent this.
>
>
> On Mar 30, 2011, at 6:11 PM, Ralph Castain wrote:
>
> I'm not the expert on this area - Josh is, so I'll defer
Sorry - should have included the devel list when I sent this.
On Mar 30, 2011, at 6:11 PM, Ralph Castain wrote:
> I'm not the expert on this area - Josh is, so I'll defer to him. I did take a
> quick glance at the sstore framework, though, and it looks like there are
> some params you could se
Hello again.
I'm working in the launch code to handle my checkpoints, but i'm a little
stuck in how to set the path to my checkpoint and the executable
(ompi_blcr_context.PID). I take a look at the code in
odls_base_default_fns.c and this piece of code took my attention:
#if OPAL_ENABLE_FT_CR ==
Thanks Ralph.
I have finished the (a) point, and now its working, now i have to work to
relaunch from my checkpoint as you said.
Best regards.
Hugo Meyer
2011/3/29 Ralph Castain
> The resilient mapper -only- works on procs being restarted - it cannot map
> a job for its initial launch. You sho
The resilient mapper -only- works on procs being restarted - it cannot map a
job for its initial launch. You shouldn't set any rmaps flag and things will
work correctly - the default round-robin mapper will map the initial launch,
and then the resilient mapper will handle restarts.
On Mar 29,
Ralph.
I'm having a problem when i try to select the rmaps resilient to be used:
/home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -v -np 4 --hostfile
../hostfile --bynode -mca rmaps resilient -mca vprotocol receiver -mca plm
rsh -mca routed cm ./coll 6 10 2>out.txt
I get this as error:
[c
Ok Ralph.
Thanks a lot for your help, i will do as you said and then let you know how
it goes.
Best Regards.
Hugo Meyer
2011/3/25 Ralph Castain
>
> On Mar 25, 2011, at 10:48 AM, Hugo Meyer wrote:
>
> From what you've described before, I suspect all you'll need to do is add
>> some code in ort
On Mar 25, 2011, at 10:48 AM, Hugo Meyer wrote:
> From what you've described before, I suspect all you'll need to do is add
> some code in orte/mca/odls/base/odls_base_default_fns.c that (a) checks to
> see if a process in the launch message is being relocated (the
> construct_child_list code
>
> From what you've described before, I suspect all you'll need to do is add
> some code in orte/mca/odls/base/odls_base_default_fns.c that (a) checks to
> see if a process in the launch message is being relocated (the
> construct_child_list code does that already), and then (b) sends the
> requir
On Mar 24, 2011, at 3:57 PM, Hugo Meyer wrote:
> 2011/3/24 Ralph Castain
> You really don't want to do it that way - you'll create a major confusion in
> mpirun and the other daemons about who is where. Have you looked at the code
> in orte/mca/errmgr/hnp/errmgr_hnp.c, line 1573 and following?
2011/3/24 Ralph Castain
> You really don't want to do it that way - you'll create a major confusion
> in mpirun and the other daemons about who is where. Have you looked at the
> code in orte/mca/errmgr/hnp/errmgr_hnp.c, line 1573 and following?
>
I did not look at that, but i will do it right no
You really don't want to do it that way - you'll create a major confusion in
mpirun and the other daemons about who is where. Have you looked at the code in
orte/mca/errmgr/hnp/errmgr_hnp.c, line 1573 and following?
The ability to relocate a failed child process is already in the trunk - it
onl
18 matches
Mail list logo