Re: [OMPI devel] Add child to another parent.

Ralph Castain Fri, 8 Apr 2011 11:12:49 -0400

On Apr 8, 2011, at 9:02 AM, Hugo Meyer wrote:

> Thanks Ralph.
> 
> I found a set_lifeline with that i think i solve that error, but, now i'm 
> dealing with another.
> 
> [clus3:32001] [[44269,0],2] -> [[44269,1],1] (node: node3) oob-tcp: Number of 
> attempts to create TCP connection has been exceeded.  Can not communicate 
> with peer
> Open MPI Error Report:[32001]: While communicating to proc [[44269,1],1] on 
> node node3, proc [[44269,0],2] on node clus3 encountered an error 
> 'Communication failure':OOB Connection retries exceeded.  Can not communicate 
> with peer
> 
> I think that this occurs because the daemon [[44269,0],2] doesn't know in 
> wich port and address has been restored the proc. I will look for a way to 
> update this information.


When the proc restarts, it calls orte_routed.init_routes. If you look in routed 
cm, you should see a call to "register_sync" - this is where the proc sends a 
message to the local daemon, allowing it to "learn" the port/address where the 
proc resides.


> 
> Best regards.
> 
> Hugo
> 
> 2011/4/6 Ralph Castain <r...@open-mpi.org>
> Looks like the lifeline is still pointing to its old daemon instead of being 
> updated to the new one. Look in orte/mca/routed/cm/routed_cm.c - should be 
> something in there that updates the lifeline during restart of a checkpoint.
> 
> 
> On Apr 6, 2011, at 7:50 AM, Hugo Meyer wrote:
> 
>> Hi all.
>> 
>> I corrected the error with the port. The mistake was because he tried to 
>> start theprocess back and the ports are static, the process was taking a 
>> port where an app was already running.
>> 
>> Initially, the process was running on [[65478,0],1] and then it moves to 
>> [[65478,0],2].
>> 
>> So now i get the socket binded, but i'm getting a communication failure in 
>> [[65478,0],1]. I'm sending as an atachment my debug output (there are some 
>> things in spanish, but there still are the open-mpi default debug output), 
>> where you can see the moment where i kill the process running con clus5 to 
>> the moment where it is restored in clus3. And then i get a TERMINATED 
>> WITHOUT SYNC in the proc restarted:
>> clus3:21615] [[65478,0],2] errmgr:orted got state TERMINATED WITHOUT SYNC 
>> for proc [[65478,1],1] pid 21705
>> 
>> Here i put the output of my stdout after the socket is binded again when the 
>> process restarts.
>> 
>> [1,1]<stdout>:SOCKET BINDED 
>> [1,1]<stdout>:[clus5:19425] App) notify_response: Waiting for final 
>> handshake.
>> [1,1]<stdout>:[clus5:19425] App) update_status: Update checkpoint status 
>> (13, /tmp/radic/1) for [[65478,1],1]
>> [1,0]<stdout>:INICIEI O BROADCAST (6)
>> [1,0]<stdout>:FINALIZEI O BROADCAST (6)
>> [1,0]<stdout>:INICIEI O BROADCAST
>> [1,3]<stdout>:INICIEI O BROADCAST (6)
>> [1,2]<stdout>:INICIEI O BROADCAST (6)
>> [1,3]<stdout>:FINALIZEI O BROADCAST (6)
>> [1,3]<stdout>:INICIEI O BROADCAST
>> [1,2]<stdout>:FINALIZEI O BROADCAST (6)
>> [1,2]<stdout>:INICIEI O BROADCAST
>> [1,1]<stdout>:[clus5:19425] [[65478,1],1] errmgr:app: job [65478,0] reported 
>> state COMMUNICATION FAILURE for proc [[65478,0],1] state COMMUNICATION 
>> FAILURE exit_code 1
>> [1,1]<stdout>:[clus5:19425] [[65478,1],1] routed:cm: Connection to lifeline 
>> [[65478,0],1] lost
>> [1,1]<stdout>:[[65478,1],1] assigned port 31256
>> 
>> Any help on how to solve this error, or how to interpret it will be greatly 
>> appreciated.
>> 
>> Best regards.
>> 
>> Hugo
>> 
>> 2011/4/5 Hugo Meyer <meyer.h...@gmail.com>
>> Hello Ralph and @ll.
>> 
>> Ralph, by following your recomendations i've already restart the process in 
>> another node from his checkpoint. But now i'm having a small problem with 
>> the oob_tcp. There is the output:
>> 
>> odls_base_default_fns:SETEANDO BLCR CONTEXT
>> CKPT-FILE: /tmp/radic/1/ompi_blcr_context.13374
>> ODLS_BASE_DEFAULT_FNS: REINICIO PROCESO EN [[34224,0],2] 
>> [1,1]<stdout>:INICIEI O BROADCAST (2)
>> [1,1]<stdout>:[clus5:13374] snapc:single:app do_checkpoint: RESTART (3)
>> [1,1]<stdout>:[clus5:13374] mca_oob_tcp_init: creating listen socket
>> [1,1]<stdout>:[clus5:13374] mca_oob_tcp_init: unable to create IPv4 listen 
>> socket: Unable to open a TCP socket for out-of-band communications
>> [1,1]<stdout>:[clus5:13374] App) notify_response: Waiting for final 
>> handshake.
>> [1,1]<stdout>:[clus5:13374] App) update_status: Update checkpoint status 
>> (13, /tmp/radic/1) for [[34224,1],1]
>> [1,0]<stdout>:INICIEI O BROADCAST (6)
>> [1,0]<stdout>:FINALIZEI O BROADCAST (6)
>> [1,0]<stdout>:INICIEI O BROADCAST
>> [1,3]<stdout>:INICIEI O BROADCAST (6)
>> [1,3]<stdout>:FINALIZEI O BROADCAST (6)
>> [1,3]<stdout>:INICIEI O BROADCAST
>> [1,1]<stdout>:[clus5:13374] [[34224,1],1] errmgr:app: job [34224,0] reported 
>> state COMMUNICATION FAILURE for proc [[34224,0],1] state COMMUNICATION 
>> FAILURE exit_code 1
>> [1,1]<stdout>:[clus5:13374] [[34224,1],1] routed:cm: Connection to lifeline 
>> [[34224,0],1] lost
>> 
>> I'm thinking that this error ocurrs because the process want to create the 
>> socket using the port that was previously assigned to it. So, if i want to 
>> restart it using another port or something how the other daemons and process 
>> will find out about this? Is this a good choice?
>> 
>> Best regards.
>> 
>> Hugo Meyer
>> 
>> 2011/3/31 Hugo Meyer <meyer.h...@gmail.com>
>> Ok Ralph. 
>> Thanks a lot, i will resend this message with a new subject.
>> 
>> Best Regards.
>> 
>> Hugo
>> 
>> 
>> 2011/3/31 Ralph Castain <r...@open-mpi.org>
>> Sorry - should have included the devel list when I sent this.
>> 
>> 
>> On Mar 30, 2011, at 6:11 PM, Ralph Castain wrote:
>> 
>>> I'm not the expert on this area - Josh is, so I'll defer to him. I did take 
>>> a quick glance at the sstore framework, though, and it looks like there are 
>>> some params you could set that might help.
>>> 
>>> "ompi_info --param sstore all"
>>> 
>>> should tell you what's available. Also, note that Josh created a man page 
>>> to explain how sstore works. It's in section 7, looks like "man 
>>> orte_sstore" should get it.
>>> 
>>> 
>>> On Mar 30, 2011, at 3:09 PM, Hugo Meyer wrote:
>>> 
>>>> Hello again.
>>>> 
>>>> I'm working in the launch code to handle my checkpoints, but i'm a little 
>>>> stuck in how to set the path to my checkpoint and the executable 
>>>> (ompi_blcr_context.PID). I take a look at the code in 
>>>> odls_base_default_fns.c and this piece of code took my attention:
>>>> 
>>>> #if OPAL_ENABLE_FT_CR == 1
>>>>             /*
>>>>              * OPAL CRS components need the opportunity to take action 
>>>> before a process
>>>>              * is forked.
>>>>              * Needs access to:
>>>>              *   - Environment
>>>>              *   - Rank/ORTE Name
>>>>              *   - Binary to exec
>>>>              */
>>>>             if( NULL != opal_crs.crs_prelaunch ) {
>>>>                 if( OPAL_SUCCESS != (rc = 
>>>> opal_crs.crs_prelaunch(child->name->vpid,
>>>>                                                                  
>>>> orte_sstore_base_prelaunch_location,
>>>>                                                                  
>>>> &(app->app),
>>>>                                                                  
>>>> &(app->cwd),
>>>>                                                                  
>>>> &(app->argv),
>>>>                                                                  
>>>> &(app->env) ) ) ) {
>>>>                     ORTE_ERROR_LOG(rc);
>>>>                     goto CLEANUP;
>>>>                 }
>>>>             }
>>>> #endif
>>>> 
>>>> But i didn't find out how to set orte_sstore_base_prelaunch_location, i 
>>>> now that initially this is set in the sstore_base_open. For example, as 
>>>> i'm transfering my checkpoint from one node to another, i store the 
>>>> checkpoint that has to be restore in /tmp/1/ and it has a name like 
>>>> ompi_blcr_context.PID.
>>>> 
>>>> Is there any function that i didn't see that allows me to do this? I'm 
>>>> asking this because I do not want to change the signature of the functions 
>>>> to pass the details of the checkpoint and the PID.
>>>> 
>>>> Best Regards.
>>>> 
>>>> Hugo Meyer
>>>> 
>>>> 2011/3/30 Hugo Meyer <meyer.h...@gmail.com>
>>>> Thanks Ralph.
>>>> I have finished the (a) point, and now its working, now i have to work to 
>>>> relaunch from my checkpoint as you said.
>>>> 
>>>> Best regards.
>>>> 
>>>> Hugo Meyer
>>>> 
>>>> 
>>>> 2011/3/29 Ralph Castain <r...@open-mpi.org>
>>>> The resilient mapper -only- works on procs being restarted - it cannot map 
>>>> a job for its initial launch. You shouldn't set any rmaps flag and things 
>>>> will work correctly - the default round-robin mapper will map the initial 
>>>> launch, and then the resilient mapper will handle restarts.
>>>> 
>>>> 
>>>> On Mar 29, 2011, at 5:18 AM, Hugo Meyer wrote:
>>>> 
>>>>> Ralph.
>>>>> 
>>>>> I'm having a problem when i try to select the rmaps resilient to be used:
>>>>> 
>>>>> /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -v -np 4 --hostfile 
>>>>> ../hostfile --bynode -mca rmaps resilient -mca vprotocol receiver -mca 
>>>>> plm rsh -mca routed cm ./coll 6 10 2>out.txt 
>>>>> 
>>>>> I get this as error:
>>>>> [clus9:25568] [[53334,0],0] hostfile: checking hostfile ../hostfile for 
>>>>> nodes
>>>>> --------------------------------------------------------------------------
>>>>> Your job failed to map. Either no mapper was available, or none
>>>>> of the available mappers was able to perform the requested
>>>>> mapping operation. This can happen if you request a map type
>>>>> (e.g., loadbalance) and the corresponding mapper was not built.
>>>>> 
>>>>> --------------------------------------------------------------------------
>>>>> [clus9:25568] errmgr:hnp:update_state() [[53334,0],0]) ------- App. 
>>>>> Process state updated for process NULL
>>>>> [clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state 
>>>>> NEVER LAUNCHED for proc NULL state UNDEFINED pid 0 exit_code 1
>>>>> [clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state 
>>>>> NEVER LAUNCHED
>>>>> [clus9:25568] [[53334,0],0] errmgr:hnp: abort called on job [53334,0] 
>>>>> with status 1
>>>>> 
>>>>> Is there a flag that i'm not turning on? or a component that i should 
>>>>> have selected?
>>>>> 
>>>>> Thanks again.
>>>>> 
>>>>> Hugo Meyer
>>>>> 
>>>>> 
>>>>> 2011/3/26 Hugo Meyer <meyer.h...@gmail.com>
>>>>> Ok Ralph.
>>>>> 
>>>>> Thanks a lot for your help, i will do as you said and then let you know 
>>>>> how it goes.
>>>>> 
>>>>> Best Regards.
>>>>> 
>>>>> Hugo Meyer
>>>>> 
>>>>> 
>>>>> 2011/3/25 Ralph Castain <r...@open-mpi.org>
>>>>> 
>>>>> On Mar 25, 2011, at 10:48 AM, Hugo Meyer wrote:
>>>>> 
>>>>>> From what you've described before, I suspect all you'll need to do is 
>>>>>> add some code in orte/mca/odls/base/odls_base_default_fns.c that (a) 
>>>>>> checks to see if a process in the launch message is being relocated (the 
>>>>>> construct_child_list code does that already), and then (b) sends the 
>>>>>> required info to all local child processes so they can take appropriate 
>>>>>> action.
>>>>>> 
>>>>>> Failure detection, re-launch, etc. have all been taken care of for you.
>>>>>> 
>>>>>> 
>>>>>> I looked at the code that you mentioned me and i realize that i have two 
>>>>>> possible options, that i'm going to share with you to know your opinion.
>>>>>> 
>>>>>> First of all i will let you know my actual situation with the 
>>>>>> implementation. As i'm working in a Fault Tolerant system, but using 
>>>>>> uncoordinated checkpoint i'm taking checkpoints of all my process at 
>>>>>> different time and storing them on the machine where there are residing, 
>>>>>> but i also send this checkpoints to another node (lets call it 
>>>>>> protector), so if this node fails his process should be restarted in the 
>>>>>> protector that have his checkpoints.
>>>>>> 
>>>>>> Right now i'm detecting the failure of a process and i know where this 
>>>>>> process should be restarted, and also i have the checkpoint in the 
>>>>>> protector. And i also have the child information of course.
>>>>>> 
>>>>>> So, my options are:
>>>>>> First Option
>>>>>> 
>>>>>> I detect the failure, and then i use 
>>>>>> orte_errmgr_hnp_base_global_update_state()  with some modifications and 
>>>>>> the hnp_relocate but changing the spawning to make a restart from a 
>>>>>> checkpoint, i suposse that using this, the migration of the process to 
>>>>>> another node will be updated and everyone will know it, because is the 
>>>>>> hnp who is going to do this (is this ok?).
>>>>> 
>>>>> This is the option I would use. The other one is much, much more work. In 
>>>>> this option, you only have to:
>>>>> 
>>>>> (a) modify the mapper so you can specify the location of the proc being 
>>>>> restarted. The resilient mapper module will be handling the restart - if 
>>>>> you look at orte/mca/rmaps/resilient/rmaps_resilient.c, you can see the 
>>>>> code doing the "replacement" and modify accordingly.
>>>>> 
>>>>> (b) add any required info about your checkpoint to the launch message. 
>>>>> This gets created in orte/mca/odls/base/odls_base_default_fns.c, the 
>>>>> "get_add_procs_data" function (at the top of the file).
>>>>> 
>>>>> (c) modify the launch code to handle your checkpoint, if required - see 
>>>>> the file in (b), the "construct_child" and "launch" functions.
>>>>> 
>>>>> HTH
>>>>> Ralph
>>>>> 
>>>>> 
>>>>>> 
>>>>>> Second Option
>>>>>> 
>>>>>> Modify one of the spawn variations(probably the remote_spawn from rsh) 
>>>>>> in the PLM framework and then use the orted_comm to command a 
>>>>>> remote_spawn in the protector, but i don't know here how to update the 
>>>>>> info so everyone knows about the change or how this is managed.
>>>>>> 
>>>>>> I might be very wrong in what I said, my apologies if so.
>>>>>> 
>>>>>> Thanks a lot for all the help.
>>>>>> 
>>>>>> Best regards.
>>>>>> 
>>>>>> Hugo Meyer
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> 
>>> 
>> 
>> 
>> 
>> 
>> <out>
> 
>

Re: [OMPI devel] Add child to another parent.

Reply via email to