Re: [OMPI devel] Add child to another parent.

2011-04-13 Thread Hugo Meyer
When the proc restarts, it calls orte_routed.init_routes. If you look in routed cm, you should see a call to "register_sync" - this is where the proc sends a message to the local daemon, allowing it to "learn" the port/address where the proc resides. I've done this. I had a problem because when i

Re: [OMPI devel] Add child to another parent.

2011-04-08 Thread Ralph Castain
On Apr 8, 2011, at 9:02 AM, Hugo Meyer wrote: > Thanks Ralph. > > I found a set_lifeline with that i think i solve that error, but, now i'm > dealing with another. > > [clus3:32001] [[44269,0],2] -> [[44269,1],1] (node: node3) oob-tcp: Number of > attempts to create TCP connection has been ex

Re: [OMPI devel] Add child to another parent.

2011-04-08 Thread Hugo Meyer
Thanks Ralph. I found a set_lifeline with that i think i solve that error, but, now i'm dealing with another. [clus3:32001] [[44269,0],2] -> [[44269,1],1] (node: node3) oob-tcp: Number of attempts to create TCP connection has been exceeded. Can not communicate with peer Open MPI Error Report:[32

Re: [OMPI devel] Add child to another parent.

2011-04-06 Thread Ralph Castain
Looks like the lifeline is still pointing to its old daemon instead of being updated to the new one. Look in orte/mca/routed/cm/routed_cm.c - should be something in there that updates the lifeline during restart of a checkpoint. On Apr 6, 2011, at 7:50 AM, Hugo Meyer wrote: > Hi all. > > I co

Re: [OMPI devel] Add child to another parent.

2011-04-06 Thread Hugo Meyer
Hi all. I corrected the error with the port. The mistake was because he tried to start theprocess back and the ports are static, the process was taking a port where an app was already running. Initially, the process was running on [[65478,0],1] and then it moves to [[65478,0],2]. So now i get t

Re: [OMPI devel] Add child to another parent.

2011-04-05 Thread Hugo Meyer
Hello Ralph and @ll. Ralph, by following your recomendations i've already restart the process in another node from his checkpoint. But now i'm having a small problem with the oob_tcp. There is the output: odls_base_default_fns:SETEANDO BLCR CONTEXT CKPT-FILE: /tmp/radic/1/ompi_blcr_context.13374

Re: [OMPI devel] Add child to another parent.

2011-03-31 Thread Hugo Meyer
Ok Ralph. Thanks a lot, i will resend this message with a new subject. Best Regards. Hugo 2011/3/31 Ralph Castain > Sorry - should have included the devel list when I sent this. > > > On Mar 30, 2011, at 6:11 PM, Ralph Castain wrote: > > I'm not the expert on this area - Josh is, so I'll defer

Re: [OMPI devel] Add child to another parent.

2011-03-30 Thread Ralph Castain
Sorry - should have included the devel list when I sent this. On Mar 30, 2011, at 6:11 PM, Ralph Castain wrote: > I'm not the expert on this area - Josh is, so I'll defer to him. I did take a > quick glance at the sstore framework, though, and it looks like there are > some params you could se

Re: [OMPI devel] Add child to another parent.

2011-03-30 Thread Hugo Meyer
Hello again. I'm working in the launch code to handle my checkpoints, but i'm a little stuck in how to set the path to my checkpoint and the executable (ompi_blcr_context.PID). I take a look at the code in odls_base_default_fns.c and this piece of code took my attention: #if OPAL_ENABLE_FT_CR ==

Re: [OMPI devel] Add child to another parent.

2011-03-30 Thread Hugo Meyer
Thanks Ralph. I have finished the (a) point, and now its working, now i have to work to relaunch from my checkpoint as you said. Best regards. Hugo Meyer 2011/3/29 Ralph Castain > The resilient mapper -only- works on procs being restarted - it cannot map > a job for its initial launch. You sho

Re: [OMPI devel] Add child to another parent.

2011-03-29 Thread Ralph Castain
The resilient mapper -only- works on procs being restarted - it cannot map a job for its initial launch. You shouldn't set any rmaps flag and things will work correctly - the default round-robin mapper will map the initial launch, and then the resilient mapper will handle restarts. On Mar 29,

Re: [OMPI devel] Add child to another parent.

2011-03-29 Thread Hugo Meyer
Ralph. I'm having a problem when i try to select the rmaps resilient to be used: /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -v -np 4 --hostfile ../hostfile --bynode -mca rmaps resilient -mca vprotocol receiver -mca plm rsh -mca routed cm ./coll 6 10 2>out.txt I get this as error: [c

Re: [OMPI devel] Add child to another parent.

2011-03-26 Thread Hugo Meyer
Ok Ralph. Thanks a lot for your help, i will do as you said and then let you know how it goes. Best Regards. Hugo Meyer 2011/3/25 Ralph Castain > > On Mar 25, 2011, at 10:48 AM, Hugo Meyer wrote: > > From what you've described before, I suspect all you'll need to do is add >> some code in ort

Re: [OMPI devel] Add child to another parent.

2011-03-25 Thread Ralph Castain
On Mar 25, 2011, at 10:48 AM, Hugo Meyer wrote: > From what you've described before, I suspect all you'll need to do is add > some code in orte/mca/odls/base/odls_base_default_fns.c that (a) checks to > see if a process in the launch message is being relocated (the > construct_child_list code

Re: [OMPI devel] Add child to another parent.

2011-03-25 Thread Hugo Meyer
> > From what you've described before, I suspect all you'll need to do is add > some code in orte/mca/odls/base/odls_base_default_fns.c that (a) checks to > see if a process in the launch message is being relocated (the > construct_child_list code does that already), and then (b) sends the > requir

Re: [OMPI devel] Add child to another parent.

2011-03-24 Thread Ralph Castain
On Mar 24, 2011, at 3:57 PM, Hugo Meyer wrote: > 2011/3/24 Ralph Castain > You really don't want to do it that way - you'll create a major confusion in > mpirun and the other daemons about who is where. Have you looked at the code > in orte/mca/errmgr/hnp/errmgr_hnp.c, line 1573 and following?

Re: [OMPI devel] Add child to another parent.

2011-03-24 Thread Hugo Meyer
2011/3/24 Ralph Castain > You really don't want to do it that way - you'll create a major confusion > in mpirun and the other daemons about who is where. Have you looked at the > code in orte/mca/errmgr/hnp/errmgr_hnp.c, line 1573 and following? > I did not look at that, but i will do it right no

Re: [OMPI devel] Add child to another parent.

2011-03-24 Thread Ralph Castain
You really don't want to do it that way - you'll create a major confusion in mpirun and the other daemons about who is where. Have you looked at the code in orte/mca/errmgr/hnp/errmgr_hnp.c, line 1573 and following? The ability to relocate a failed child process is already in the trunk - it onl