Hello Ralph and @ll. Ralph, by following your recomendations i've already restart the process in another node from his checkpoint. But now i'm having a small problem with the oob_tcp. There is the output:
odls_base_default_fns:SETEANDO BLCR CONTEXT CKPT-FILE: /tmp/radic/1/ompi_blcr_context.13374 ODLS_BASE_DEFAULT_FNS: REINICIO PROCESO EN [[34224,0],2] [1,1]<stdout>:INICIEI O BROADCAST (2) [1,1]<stdout>:[clus5:13374] snapc:single:app do_checkpoint: RESTART (3) *[1,1]<stdout>:[clus5:13374] mca_oob_tcp_init: creating listen socket* *[1,1]<stdout>:[clus5:13374] mca_oob_tcp_init: unable to create IPv4 listen socket: Unable to open a TCP socket for out-of-band communications* [1,1]<stdout>:[clus5:13374] App) notify_response: Waiting for final handshake*.* [1,1]<stdout>:[clus5:13374] App) update_status: Update checkpoint status (13, /tmp/radic/1) for [[34224,1],1] [1,0]<stdout>:INICIEI O BROADCAST (6) [1,0]<stdout>:FINALIZEI O BROADCAST (6) [1,0]<stdout>:INICIEI O BROADCAST [1,3]<stdout>:INICIEI O BROADCAST (6) [1,3]<stdout>:FINALIZEI O BROADCAST (6) [1,3]<stdout>:INICIEI O BROADCAST *[1,1]<stdout>:[clus5:13374] [[34224,1],1] errmgr:app: job [34224,0] reported state COMMUNICATION FAILURE for proc [[34224,0],1] state COMMUNICATION FAILURE exit_code 1* *[1,1]<stdout>:[clus5:13374] [[34224,1],1] routed:cm: Connection to lifeline [[34224,0],1] lost* I'm thinking that this error ocurrs because the process want to create the socket using the port that was previously assigned to it. So, if i want to restart it using another port or something how the other daemons and process will find out about this? Is this a good choice? Best regards. Hugo Meyer 2011/3/31 Hugo Meyer <[email protected]> > Ok Ralph. > Thanks a lot, i will resend this message with a new subject. > > Best Regards. > > Hugo > > > 2011/3/31 Ralph Castain <[email protected]> > >> Sorry - should have included the devel list when I sent this. >> >> >> On Mar 30, 2011, at 6:11 PM, Ralph Castain wrote: >> >> I'm not the expert on this area - Josh is, so I'll defer to him. I did >> take a quick glance at the sstore framework, though, and it looks like there >> are some params you could set that might help. >> >> "ompi_info --param sstore all" >> >> should tell you what's available. Also, note that Josh created a man page >> to explain how sstore works. It's in section 7, looks like "man orte_sstore" >> should get it. >> >> >> On Mar 30, 2011, at 3:09 PM, Hugo Meyer wrote: >> >> Hello again. >> >> I'm working in the launch code to handle my checkpoints, but i'm a little >> stuck in how to set the path to my checkpoint and the executable >> (ompi_blcr_context.PID). I take a look at the code in >> odls_base_default_fns.c and this piece of code took my attention: >> >> #if OPAL_ENABLE_FT_CR == 1 >> /* >> * OPAL CRS components need the opportunity to take action >> before a process >> * is forked. >> * Needs access to: >> * - Environment >> * - Rank/ORTE Name >> * - Binary to exec >> */ >> if( NULL != opal_crs.crs_prelaunch ) { >> if( OPAL_SUCCESS != (rc = >> opal_crs.crs_prelaunch(child->name->vpid, >> >> orte_sstore_base_prelaunch_location, >> >> &(app->app), >> >> &(app->cwd), >> >> &(app->argv), >> >> &(app->env) ) ) ) { >> ORTE_ERROR_LOG(rc); >> goto CLEANUP; >> } >> } >> #endif >> >> >> But i didn't find out how to set orte_sstore_base_prelaunch_location, i >> now that initially this is set in the sstore_base_open. For example, as i'm >> transfering my checkpoint from one node to another, i store the checkpoint >> that has to be restore in /tmp/1/ and it has a name >> like ompi_blcr_context.PID. >> >> Is there any function that i didn't see that allows me to do this? I'm >> asking this because I do not want to change the signature of the >> functions to pass the details of the checkpoint and the PID. >> >> Best Regards. >> >> Hugo Meyer >> >> 2011/3/30 Hugo Meyer <[email protected]> >> >>> Thanks Ralph. >>> I have finished the (a) point, and now its working, now i have to work to >>> relaunch from my checkpoint as you said. >>> >>> Best regards. >>> >>> Hugo Meyer >>> >>> >>> 2011/3/29 Ralph Castain <[email protected]> >>> >>>> The resilient mapper -only- works on procs being restarted - it cannot >>>> map a job for its initial launch. You shouldn't set any rmaps flag and >>>> things will work correctly - the default round-robin mapper will map the >>>> initial launch, and then the resilient mapper will handle restarts. >>>> >>>> >>>> On Mar 29, 2011, at 5:18 AM, Hugo Meyer wrote: >>>> >>>> Ralph. >>>> >>>> I'm having a problem when i try to select the rmaps resilient to be >>>> used: >>>> >>>> /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -v -np 4 >>>> --hostfile ../hostfile --bynode -mca rmaps resilient -mca vprotocol >>>> receiver >>>> -mca plm rsh -mca routed cm ./coll 6 10 2>out.txt >>>> >>>> >>>> I get this as error: >>>> >>>> [clus9:25568] [[53334,0],0] hostfile: checking hostfile ../hostfile for >>>> nodes >>>> >>>> -------------------------------------------------------------------------- >>>> Your job failed to map. Either no mapper was available, or none >>>> of the available mappers was able to perform the requested >>>> mapping operation. This can happen if you request a map type >>>> (e.g., loadbalance) and the corresponding mapper was not built. >>>> >>>> >>>> -------------------------------------------------------------------------- >>>> [clus9:25568] errmgr:hnp:update_state() [[53334,0],0]) ------- App. >>>> Process state updated for process NULL >>>> [clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state >>>> NEVER LAUNCHED for proc NULL state UNDEFINED pid 0 exit_code 1 >>>> [clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state >>>> NEVER LAUNCHED >>>> [clus9:25568] [[53334,0],0] errmgr:hnp: abort called on job [53334,0] >>>> with status 1 >>>> >>>> >>>> Is there a flag that i'm not turning on? or a component that i should >>>> have selected? >>>> >>>> Thanks again. >>>> >>>> Hugo Meyer >>>> >>>> >>>> 2011/3/26 Hugo Meyer <[email protected]> >>>> >>>>> Ok Ralph. >>>>> >>>>> Thanks a lot for your help, i will do as you said and then let you know >>>>> how it goes. >>>>> >>>>> Best Regards. >>>>> >>>>> Hugo Meyer >>>>> >>>>> >>>>> 2011/3/25 Ralph Castain <[email protected]> >>>>> >>>>>> >>>>>> On Mar 25, 2011, at 10:48 AM, Hugo Meyer wrote: >>>>>> >>>>>> From what you've described before, I suspect all you'll need to do is >>>>>>> add some code in orte/mca/odls/base/odls_base_default_fns.c that (a) >>>>>>> checks >>>>>>> to see if a process in the launch message is being relocated (the >>>>>>> construct_child_list code does that already), and then (b) sends the >>>>>>> required info to all local child processes so they can take appropriate >>>>>>> action. >>>>>>> >>>>>>> Failure detection, re-launch, etc. have all been taken care of for >>>>>>> you. >>>>>>> >>>>>> >>>>>> >>>>>> I looked at the code that you mentioned me and i realize that i have >>>>>> two possible options, that i'm going to share with you to know your >>>>>> opinion. >>>>>> >>>>>> First of all i will let you know my actual situation with the >>>>>> implementation. As i'm working in a Fault Tolerant system, but using >>>>>> uncoordinated checkpoint i'm taking checkpoints of all my process at >>>>>> different time and storing them on the machine where there are residing, >>>>>> but >>>>>> i also send this checkpoints to another node (lets call it protector), >>>>>> so if >>>>>> this node fails his process should be restarted in the protector that >>>>>> have >>>>>> his checkpoints. >>>>>> >>>>>> Right now i'm detecting the failure of a process and i know where this >>>>>> process should be restarted, and also i have the checkpoint in the >>>>>> protector. And i also have the child information of course. >>>>>> >>>>>> So, my options are: >>>>>> *First Option* >>>>>> * >>>>>> * >>>>>> I detect the failure, and then i use >>>>>> orte_errmgr_hnp_base_global_update_state() with some modifications and >>>>>> the >>>>>> hnp_relocate but changing the spawning to make a restart from a >>>>>> checkpoint, >>>>>> i suposse that using this, the migration of the process to another node >>>>>> will >>>>>> be updated and everyone will know it, because is the hnp who is going to >>>>>> do >>>>>> this (is this ok?). >>>>>> >>>>>> >>>>>> This is the option I would use. The other one is much, much more work. >>>>>> In this option, you only have to: >>>>>> >>>>>> (a) modify the mapper so you can specify the location of the proc >>>>>> being restarted. The resilient mapper module will be handling the >>>>>> restart - >>>>>> if you look at orte/mca/rmaps/resilient/rmaps_resilient.c, you can see >>>>>> the >>>>>> code doing the "replacement" and modify accordingly. >>>>>> >>>>>> (b) add any required info about your checkpoint to the launch message. >>>>>> This gets created in orte/mca/odls/base/odls_base_default_fns.c, the >>>>>> "get_add_procs_data" function (at the top of the file). >>>>>> >>>>>> (c) modify the launch code to handle your checkpoint, if required - >>>>>> see the file in (b), the "construct_child" and "launch" functions. >>>>>> >>>>>> HTH >>>>>> Ralph >>>>>> >>>>>> >>>>>> >>>>>> *Second Option* >>>>>> * >>>>>> * >>>>>> Modify one of the spawn variations(probably the remote_spawn from >>>>>> rsh) in the PLM framework and then use the orted_comm to command a >>>>>> remote_spawn in the protector, but i don't know here how to update the >>>>>> info >>>>>> so everyone knows about the change or how this is managed. >>>>>> >>>>>> I might be very wrong in what I said, my apologies if so. >>>>>> >>>>>> Thanks a lot for all the help. >>>>>> >>>>>> Best regards. >>>>>> >>>>>> Hugo Meyer >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> [email protected] >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> [email protected] >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>> >>>>> >>>> _______________________________________________ >>>> devel mailing list >>>> [email protected] >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> [email protected] >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>> >>> >> >> >> >
