Thanks Ralph. I have finished the (a) point, and now its working, now i have to work to relaunch from my checkpoint as you said.
Best regards. Hugo Meyer 2011/3/29 Ralph Castain <r...@open-mpi.org> > The resilient mapper -only- works on procs being restarted - it cannot map > a job for its initial launch. You shouldn't set any rmaps flag and things > will work correctly - the default round-robin mapper will map the initial > launch, and then the resilient mapper will handle restarts. > > > On Mar 29, 2011, at 5:18 AM, Hugo Meyer wrote: > > Ralph. > > I'm having a problem when i try to select the rmaps resilient to be used: > > /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -v -np 4 --hostfile > ../hostfile --bynode -mca rmaps resilient -mca vprotocol receiver -mca plm > rsh -mca routed cm ./coll 6 10 2>out.txt > > > I get this as error: > > [clus9:25568] [[53334,0],0] hostfile: checking hostfile ../hostfile for > nodes > -------------------------------------------------------------------------- > Your job failed to map. Either no mapper was available, or none > of the available mappers was able to perform the requested > mapping operation. This can happen if you request a map type > (e.g., loadbalance) and the corresponding mapper was not built. > > -------------------------------------------------------------------------- > [clus9:25568] errmgr:hnp:update_state() [[53334,0],0]) ------- App. Process > state updated for process NULL > [clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state NEVER > LAUNCHED for proc NULL state UNDEFINED pid 0 exit_code 1 > [clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state NEVER > LAUNCHED > [clus9:25568] [[53334,0],0] errmgr:hnp: abort called on job [53334,0] with > status 1 > > > Is there a flag that i'm not turning on? or a component that i should have > selected? > > Thanks again. > > Hugo Meyer > > > 2011/3/26 Hugo Meyer <meyer.h...@gmail.com> > >> Ok Ralph. >> >> Thanks a lot for your help, i will do as you said and then let you know >> how it goes. >> >> Best Regards. >> >> Hugo Meyer >> >> >> 2011/3/25 Ralph Castain <r...@open-mpi.org> >> >>> >>> On Mar 25, 2011, at 10:48 AM, Hugo Meyer wrote: >>> >>> From what you've described before, I suspect all you'll need to do is add >>>> some code in orte/mca/odls/base/odls_base_default_fns.c that (a) checks to >>>> see if a process in the launch message is being relocated (the >>>> construct_child_list code does that already), and then (b) sends the >>>> required info to all local child processes so they can take appropriate >>>> action. >>>> >>>> Failure detection, re-launch, etc. have all been taken care of for you. >>>> >>> >>> >>> I looked at the code that you mentioned me and i realize that i have two >>> possible options, that i'm going to share with you to know your opinion. >>> >>> First of all i will let you know my actual situation with the >>> implementation. As i'm working in a Fault Tolerant system, but using >>> uncoordinated checkpoint i'm taking checkpoints of all my process at >>> different time and storing them on the machine where there are residing, but >>> i also send this checkpoints to another node (lets call it protector), so if >>> this node fails his process should be restarted in the protector that have >>> his checkpoints. >>> >>> Right now i'm detecting the failure of a process and i know where this >>> process should be restarted, and also i have the checkpoint in the >>> protector. And i also have the child information of course. >>> >>> So, my options are: >>> *First Option* >>> * >>> * >>> I detect the failure, and then i use >>> orte_errmgr_hnp_base_global_update_state() with some modifications and the >>> hnp_relocate but changing the spawning to make a restart from a checkpoint, >>> i suposse that using this, the migration of the process to another node will >>> be updated and everyone will know it, because is the hnp who is going to do >>> this (is this ok?). >>> >>> >>> This is the option I would use. The other one is much, much more work. In >>> this option, you only have to: >>> >>> (a) modify the mapper so you can specify the location of the proc being >>> restarted. The resilient mapper module will be handling the restart - if you >>> look at orte/mca/rmaps/resilient/rmaps_resilient.c, you can see the code >>> doing the "replacement" and modify accordingly. >>> >>> (b) add any required info about your checkpoint to the launch message. >>> This gets created in orte/mca/odls/base/odls_base_default_fns.c, the >>> "get_add_procs_data" function (at the top of the file). >>> >>> (c) modify the launch code to handle your checkpoint, if required - see >>> the file in (b), the "construct_child" and "launch" functions. >>> >>> HTH >>> Ralph >>> >>> >>> >>> *Second Option* >>> * >>> * >>> Modify one of the spawn variations(probably the remote_spawn from rsh) in >>> the PLM framework and then use the orted_comm to command a remote_spawn in >>> the protector, but i don't know here how to update the info so everyone >>> knows about the change or how this is managed. >>> >>> I might be very wrong in what I said, my apologies if so. >>> >>> Thanks a lot for all the help. >>> >>> Best regards. >>> >>> Hugo Meyer >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> >> > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >