Re: [OMPI devel] Add child to another parent.

2011-04-13 Thread Hugo Meyer
When the proc restarts, it calls orte_routed.init_routes. If you look in routed cm, you should see a call to "register_sync" - this is where the proc sends a message to the local daemon, allowing it to "learn" the port/address where the proc resides. I've done this. I had a problem because when i

Re: [OMPI devel] Add child to another parent.

2011-04-08 Thread Ralph Castain
On Apr 8, 2011, at 9:02 AM, Hugo Meyer wrote: > Thanks Ralph. > > I found a set_lifeline with that i think i solve that error, but, now i'm > dealing with another. > > [clus3:32001] [[44269,0],2] -> [[44269,1],1] (node: node3) oob-tcp: Number of > attempts to create TCP connection has been ex

Re: [OMPI devel] Add child to another parent.

2011-04-08 Thread Hugo Meyer
Thanks Ralph. I found a set_lifeline with that i think i solve that error, but, now i'm dealing with another. [clus3:32001] [[44269,0],2] -> [[44269,1],1] (node: node3) oob-tcp: Number of attempts to create TCP connection has been exceeded. Can not communicate with peer Open MPI Error Report:[32

Re: [OMPI devel] Add child to another parent.

2011-04-06 Thread Ralph Castain
Looks like the lifeline is still pointing to its old daemon instead of being updated to the new one. Look in orte/mca/routed/cm/routed_cm.c - should be something in there that updates the lifeline during restart of a checkpoint. On Apr 6, 2011, at 7:50 AM, Hugo Meyer wrote: > Hi all. > > I co

Re: [OMPI devel] Add child to another parent.

2011-04-06 Thread Hugo Meyer
Hi all. I corrected the error with the port. The mistake was because he tried to start theprocess back and the ports are static, the process was taking a port where an app was already running. Initially, the process was running on [[65478,0],1] and then it moves to [[65478,0],2]. So now i get t

Re: [OMPI devel] Add child to another parent.

2011-04-05 Thread Hugo Meyer
Hello Ralph and @ll. Ralph, by following your recomendations i've already restart the process in another node from his checkpoint. But now i'm having a small problem with the oob_tcp. There is the output: odls_base_default_fns:SETEANDO BLCR CONTEXT CKPT-FILE: /tmp/radic/1/ompi_blcr_context.13374

Re: [OMPI devel] Add child to another parent.

2011-03-31 Thread Hugo Meyer
Ok Ralph. Thanks a lot, i will resend this message with a new subject. Best Regards. Hugo 2011/3/31 Ralph Castain > Sorry - should have included the devel list when I sent this. > > > On Mar 30, 2011, at 6:11 PM, Ralph Castain wrote: > > I'm not the expert on this area - Josh is, so I'll defer

Re: [OMPI devel] Add child to another parent.

2011-03-30 Thread Ralph Castain
Sorry - should have included the devel list when I sent this. On Mar 30, 2011, at 6:11 PM, Ralph Castain wrote: > I'm not the expert on this area - Josh is, so I'll defer to him. I did take a > quick glance at the sstore framework, though, and it looks like there are > some params you could se

Re: [OMPI devel] Add child to another parent.

2011-03-30 Thread Hugo Meyer
Hello again. I'm working in the launch code to handle my checkpoints, but i'm a little stuck in how to set the path to my checkpoint and the executable (ompi_blcr_context.PID). I take a look at the code in odls_base_default_fns.c and this piece of code took my attention: #if OPAL_ENABLE_FT_CR ==

Re: [OMPI devel] Add child to another parent.

2011-03-30 Thread Hugo Meyer
Thanks Ralph. I have finished the (a) point, and now its working, now i have to work to relaunch from my checkpoint as you said. Best regards. Hugo Meyer 2011/3/29 Ralph Castain > The resilient mapper -only- works on procs being restarted - it cannot map > a job for its initial launch. You sho

Re: [OMPI devel] Add child to another parent.

2011-03-29 Thread Ralph Castain
The resilient mapper -only- works on procs being restarted - it cannot map a job for its initial launch. You shouldn't set any rmaps flag and things will work correctly - the default round-robin mapper will map the initial launch, and then the resilient mapper will handle restarts. On Mar 29,

Re: [OMPI devel] Add child to another parent.

2011-03-29 Thread Hugo Meyer
Ralph. I'm having a problem when i try to select the rmaps resilient to be used: /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -v -np 4 --hostfile ../hostfile --bynode -mca rmaps resilient -mca vprotocol receiver -mca plm rsh -mca routed cm ./coll 6 10 2>out.txt I get this as error: [c

Re: [OMPI devel] Add child to another parent.

2011-03-26 Thread Hugo Meyer
Ok Ralph. Thanks a lot for your help, i will do as you said and then let you know how it goes. Best Regards. Hugo Meyer 2011/3/25 Ralph Castain > > On Mar 25, 2011, at 10:48 AM, Hugo Meyer wrote: > > From what you've described before, I suspect all you'll need to do is add >> some code in ort

Re: [OMPI devel] Add child to another parent.

2011-03-25 Thread Ralph Castain
On Mar 25, 2011, at 10:48 AM, Hugo Meyer wrote: > From what you've described before, I suspect all you'll need to do is add > some code in orte/mca/odls/base/odls_base_default_fns.c that (a) checks to > see if a process in the launch message is being relocated (the > construct_child_list code

Re: [OMPI devel] Add child to another parent.

2011-03-25 Thread Hugo Meyer
> > From what you've described before, I suspect all you'll need to do is add > some code in orte/mca/odls/base/odls_base_default_fns.c that (a) checks to > see if a process in the launch message is being relocated (the > construct_child_list code does that already), and then (b) sends the > requir

Re: [OMPI devel] Add child to another parent.

2011-03-24 Thread Ralph Castain
On Mar 24, 2011, at 3:57 PM, Hugo Meyer wrote: > 2011/3/24 Ralph Castain > You really don't want to do it that way - you'll create a major confusion in > mpirun and the other daemons about who is where. Have you looked at the code > in orte/mca/errmgr/hnp/errmgr_hnp.c, line 1573 and following?

Re: [OMPI devel] Add child to another parent.

2011-03-24 Thread Hugo Meyer
2011/3/24 Ralph Castain > You really don't want to do it that way - you'll create a major confusion > in mpirun and the other daemons about who is where. Have you looked at the > code in orte/mca/errmgr/hnp/errmgr_hnp.c, line 1573 and following? > I did not look at that, but i will do it right no

Re: [OMPI devel] Add child to another parent.

2011-03-24 Thread Ralph Castain
You really don't want to do it that way - you'll create a major confusion in mpirun and the other daemons about who is where. Have you looked at the code in orte/mca/errmgr/hnp/errmgr_hnp.c, line 1573 and following? The ability to relocate a failed child process is already in the trunk - it onl

[OMPI devel] Add child to another parent.

2011-03-24 Thread Hugo Meyer
Hello @ll. I'm trying to restart a child that has failed, now i'm catching the failed child in the errmgr and then i'm packing the child and sending it to another node who has to "adopt" it. Is there any way to do this with te actual implementation? something like add_child. Because the i will hav