Ok Ralph. Thanks a lot, i will resend this message with a new subject. Best Regards.
Hugo 2011/3/31 Ralph Castain <r...@open-mpi.org> > Sorry - should have included the devel list when I sent this. > > > On Mar 30, 2011, at 6:11 PM, Ralph Castain wrote: > > I'm not the expert on this area - Josh is, so I'll defer to him. I did take > a quick glance at the sstore framework, though, and it looks like there are > some params you could set that might help. > > "ompi_info --param sstore all" > > should tell you what's available. Also, note that Josh created a man page > to explain how sstore works. It's in section 7, looks like "man orte_sstore" > should get it. > > > On Mar 30, 2011, at 3:09 PM, Hugo Meyer wrote: > > Hello again. > > I'm working in the launch code to handle my checkpoints, but i'm a little > stuck in how to set the path to my checkpoint and the executable > (ompi_blcr_context.PID). I take a look at the code in > odls_base_default_fns.c and this piece of code took my attention: > > #if OPAL_ENABLE_FT_CR == 1 > /* > * OPAL CRS components need the opportunity to take action > before a process > * is forked. > * Needs access to: > * - Environment > * - Rank/ORTE Name > * - Binary to exec > */ > if( NULL != opal_crs.crs_prelaunch ) { > if( OPAL_SUCCESS != (rc = > opal_crs.crs_prelaunch(child->name->vpid, > > orte_sstore_base_prelaunch_location, > > &(app->app), > > &(app->cwd), > > &(app->argv), > > &(app->env) ) ) ) { > ORTE_ERROR_LOG(rc); > goto CLEANUP; > } > } > #endif > > > But i didn't find out how to set orte_sstore_base_prelaunch_location, i now > that initially this is set in the sstore_base_open. For example, as i'm > transfering my checkpoint from one node to another, i store the checkpoint > that has to be restore in /tmp/1/ and it has a name > like ompi_blcr_context.PID. > > Is there any function that i didn't see that allows me to do this? I'm > asking this because I do not want to change the signature of the functions > to pass the details of the checkpoint and the PID. > > Best Regards. > > Hugo Meyer > > 2011/3/30 Hugo Meyer <meyer.h...@gmail.com> > >> Thanks Ralph. >> I have finished the (a) point, and now its working, now i have to work to >> relaunch from my checkpoint as you said. >> >> Best regards. >> >> Hugo Meyer >> >> >> 2011/3/29 Ralph Castain <r...@open-mpi.org> >> >>> The resilient mapper -only- works on procs being restarted - it cannot >>> map a job for its initial launch. You shouldn't set any rmaps flag and >>> things will work correctly - the default round-robin mapper will map the >>> initial launch, and then the resilient mapper will handle restarts. >>> >>> >>> On Mar 29, 2011, at 5:18 AM, Hugo Meyer wrote: >>> >>> Ralph. >>> >>> I'm having a problem when i try to select the rmaps resilient to be used: >>> >>> /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -v -np 4 --hostfile >>> ../hostfile --bynode -mca rmaps resilient -mca vprotocol receiver -mca plm >>> rsh -mca routed cm ./coll 6 10 2>out.txt >>> >>> >>> I get this as error: >>> >>> [clus9:25568] [[53334,0],0] hostfile: checking hostfile ../hostfile for >>> nodes >>> >>> -------------------------------------------------------------------------- >>> Your job failed to map. Either no mapper was available, or none >>> of the available mappers was able to perform the requested >>> mapping operation. This can happen if you request a map type >>> (e.g., loadbalance) and the corresponding mapper was not built. >>> >>> >>> -------------------------------------------------------------------------- >>> [clus9:25568] errmgr:hnp:update_state() [[53334,0],0]) ------- App. >>> Process state updated for process NULL >>> [clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state >>> NEVER LAUNCHED for proc NULL state UNDEFINED pid 0 exit_code 1 >>> [clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state >>> NEVER LAUNCHED >>> [clus9:25568] [[53334,0],0] errmgr:hnp: abort called on job [53334,0] >>> with status 1 >>> >>> >>> Is there a flag that i'm not turning on? or a component that i should >>> have selected? >>> >>> Thanks again. >>> >>> Hugo Meyer >>> >>> >>> 2011/3/26 Hugo Meyer <meyer.h...@gmail.com> >>> >>>> Ok Ralph. >>>> >>>> Thanks a lot for your help, i will do as you said and then let you know >>>> how it goes. >>>> >>>> Best Regards. >>>> >>>> Hugo Meyer >>>> >>>> >>>> 2011/3/25 Ralph Castain <r...@open-mpi.org> >>>> >>>>> >>>>> On Mar 25, 2011, at 10:48 AM, Hugo Meyer wrote: >>>>> >>>>> From what you've described before, I suspect all you'll need to do is >>>>>> add some code in orte/mca/odls/base/odls_base_default_fns.c that (a) >>>>>> checks >>>>>> to see if a process in the launch message is being relocated (the >>>>>> construct_child_list code does that already), and then (b) sends the >>>>>> required info to all local child processes so they can take appropriate >>>>>> action. >>>>>> >>>>>> Failure detection, re-launch, etc. have all been taken care of for >>>>>> you. >>>>>> >>>>> >>>>> >>>>> I looked at the code that you mentioned me and i realize that i have >>>>> two possible options, that i'm going to share with you to know your >>>>> opinion. >>>>> >>>>> First of all i will let you know my actual situation with the >>>>> implementation. As i'm working in a Fault Tolerant system, but using >>>>> uncoordinated checkpoint i'm taking checkpoints of all my process at >>>>> different time and storing them on the machine where there are residing, >>>>> but >>>>> i also send this checkpoints to another node (lets call it protector), so >>>>> if >>>>> this node fails his process should be restarted in the protector that have >>>>> his checkpoints. >>>>> >>>>> Right now i'm detecting the failure of a process and i know where this >>>>> process should be restarted, and also i have the checkpoint in the >>>>> protector. And i also have the child information of course. >>>>> >>>>> So, my options are: >>>>> *First Option* >>>>> * >>>>> * >>>>> I detect the failure, and then i use >>>>> orte_errmgr_hnp_base_global_update_state() with some modifications and >>>>> the >>>>> hnp_relocate but changing the spawning to make a restart from a >>>>> checkpoint, >>>>> i suposse that using this, the migration of the process to another node >>>>> will >>>>> be updated and everyone will know it, because is the hnp who is going to >>>>> do >>>>> this (is this ok?). >>>>> >>>>> >>>>> This is the option I would use. The other one is much, much more work. >>>>> In this option, you only have to: >>>>> >>>>> (a) modify the mapper so you can specify the location of the proc being >>>>> restarted. The resilient mapper module will be handling the restart - if >>>>> you >>>>> look at orte/mca/rmaps/resilient/rmaps_resilient.c, you can see the code >>>>> doing the "replacement" and modify accordingly. >>>>> >>>>> (b) add any required info about your checkpoint to the launch message. >>>>> This gets created in orte/mca/odls/base/odls_base_default_fns.c, the >>>>> "get_add_procs_data" function (at the top of the file). >>>>> >>>>> (c) modify the launch code to handle your checkpoint, if required - see >>>>> the file in (b), the "construct_child" and "launch" functions. >>>>> >>>>> HTH >>>>> Ralph >>>>> >>>>> >>>>> >>>>> *Second Option* >>>>> * >>>>> * >>>>> Modify one of the spawn variations(probably the remote_spawn from rsh) in >>>>> the PLM framework and then use the orted_comm to command a remote_spawn in >>>>> the protector, but i don't know here how to update the info so everyone >>>>> knows about the change or how this is managed. >>>>> >>>>> I might be very wrong in what I said, my apologies if so. >>>>> >>>>> Thanks a lot for all the help. >>>>> >>>>> Best regards. >>>>> >>>>> Hugo Meyer >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>> >>>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> >> > > >