Re: [OMPI devel] Add child to another parent.

Ralph Castain Wed, 30 Mar 2011 20:13:13 -0400

Sorry - should have included the devel list when I sent this.


On Mar 30, 2011, at 6:11 PM, Ralph Castain wrote:

> I'm not the expert on this area - Josh is, so I'll defer to him. I did take a 
> quick glance at the sstore framework, though, and it looks like there are 
> some params you could set that might help.
> 
> "ompi_info --param sstore all"
> 
> should tell you what's available. Also, note that Josh created a man page to 
> explain how sstore works. It's in section 7, looks like "man orte_sstore" 
> should get it.
> 
> 
> On Mar 30, 2011, at 3:09 PM, Hugo Meyer wrote:
> 
>> Hello again.
>> 
>> I'm working in the launch code to handle my checkpoints, but i'm a little 
>> stuck in how to set the path to my checkpoint and the executable 
>> (ompi_blcr_context.PID). I take a look at the code in 
>> odls_base_default_fns.c and this piece of code took my attention:
>> 
>> #if OPAL_ENABLE_FT_CR == 1
>>             /*
>>              * OPAL CRS components need the opportunity to take action 
>> before a process
>>              * is forked.
>>              * Needs access to:
>>              *   - Environment
>>              *   - Rank/ORTE Name
>>              *   - Binary to exec
>>              */
>>             if( NULL != opal_crs.crs_prelaunch ) {
>>                 if( OPAL_SUCCESS != (rc = 
>> opal_crs.crs_prelaunch(child->name->vpid,
>>                                                                  
>> orte_sstore_base_prelaunch_location,
>>                                                                  &(app->app),
>>                                                                  &(app->cwd),
>>                                                                  
>> &(app->argv),
>>                                                                  &(app->env) 
>> ) ) ) {
>>                     ORTE_ERROR_LOG(rc);
>>                     goto CLEANUP;
>>                 }
>>             }
>> #endif
>> 
>> But i didn't find out how to set orte_sstore_base_prelaunch_location, i now 
>> that initially this is set in the sstore_base_open. For example, as i'm 
>> transfering my checkpoint from one node to another, i store the checkpoint 
>> that has to be restore in /tmp/1/ and it has a name like 
>> ompi_blcr_context.PID.
>> 
>> Is there any function that i didn't see that allows me to do this? I'm 
>> asking this because I do not want to change the signature of the functions 
>> to pass the details of the checkpoint and the PID.
>> 
>> Best Regards.
>> 
>> Hugo Meyer
>> 
>> 2011/3/30 Hugo Meyer <[email protected]>
>> Thanks Ralph.
>> I have finished the (a) point, and now its working, now i have to work to 
>> relaunch from my checkpoint as you said.
>> 
>> Best regards.
>> 
>> Hugo Meyer
>> 
>> 
>> 2011/3/29 Ralph Castain <[email protected]>
>> The resilient mapper -only- works on procs being restarted - it cannot map a 
>> job for its initial launch. You shouldn't set any rmaps flag and things will 
>> work correctly - the default round-robin mapper will map the initial launch, 
>> and then the resilient mapper will handle restarts.
>> 
>> 
>> On Mar 29, 2011, at 5:18 AM, Hugo Meyer wrote:
>> 
>>> Ralph.
>>> 
>>> I'm having a problem when i try to select the rmaps resilient to be used:
>>> 
>>> /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -v -np 4 --hostfile 
>>> ../hostfile --bynode -mca rmaps resilient -mca vprotocol receiver -mca plm 
>>> rsh -mca routed cm ./coll 6 10 2>out.txt 
>>> 
>>> I get this as error:
>>> [clus9:25568] [[53334,0],0] hostfile: checking hostfile ../hostfile for 
>>> nodes
>>> --------------------------------------------------------------------------
>>> Your job failed to map. Either no mapper was available, or none
>>> of the available mappers was able to perform the requested
>>> mapping operation. This can happen if you request a map type
>>> (e.g., loadbalance) and the corresponding mapper was not built.
>>> 
>>> --------------------------------------------------------------------------
>>> [clus9:25568] errmgr:hnp:update_state() [[53334,0],0]) ------- App. Process 
>>> state updated for process NULL
>>> [clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state NEVER 
>>> LAUNCHED for proc NULL state UNDEFINED pid 0 exit_code 1
>>> [clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state NEVER 
>>> LAUNCHED
>>> [clus9:25568] [[53334,0],0] errmgr:hnp: abort called on job [53334,0] with 
>>> status 1
>>> 
>>> Is there a flag that i'm not turning on? or a component that i should have 
>>> selected?
>>> 
>>> Thanks again.
>>> 
>>> Hugo Meyer
>>> 
>>> 
>>> 2011/3/26 Hugo Meyer <[email protected]>
>>> Ok Ralph.
>>> 
>>> Thanks a lot for your help, i will do as you said and then let you know how 
>>> it goes.
>>> 
>>> Best Regards.
>>> 
>>> Hugo Meyer
>>> 
>>> 
>>> 2011/3/25 Ralph Castain <[email protected]>
>>> 
>>> On Mar 25, 2011, at 10:48 AM, Hugo Meyer wrote:
>>> 
>>>> From what you've described before, I suspect all you'll need to do is add 
>>>> some code in orte/mca/odls/base/odls_base_default_fns.c that (a) checks to 
>>>> see if a process in the launch message is being relocated (the 
>>>> construct_child_list code does that already), and then (b) sends the 
>>>> required info to all local child processes so they can take appropriate 
>>>> action.
>>>> 
>>>> Failure detection, re-launch, etc. have all been taken care of for you.
>>>> 
>>>> 
>>>> I looked at the code that you mentioned me and i realize that i have two 
>>>> possible options, that i'm going to share with you to know your opinion.
>>>> 
>>>> First of all i will let you know my actual situation with the 
>>>> implementation. As i'm working in a Fault Tolerant system, but using 
>>>> uncoordinated checkpoint i'm taking checkpoints of all my process at 
>>>> different time and storing them on the machine where there are residing, 
>>>> but i also send this checkpoints to another node (lets call it protector), 
>>>> so if this node fails his process should be restarted in the protector 
>>>> that have his checkpoints.
>>>> 
>>>> Right now i'm detecting the failure of a process and i know where this 
>>>> process should be restarted, and also i have the checkpoint in the 
>>>> protector. And i also have the child information of course.
>>>> 
>>>> So, my options are:
>>>> First Option
>>>> 
>>>> I detect the failure, and then i use 
>>>> orte_errmgr_hnp_base_global_update_state()  with some modifications and 
>>>> the hnp_relocate but changing the spawning to make a restart from a 
>>>> checkpoint, i suposse that using this, the migration of the process to 
>>>> another node will be updated and everyone will know it, because is the hnp 
>>>> who is going to do this (is this ok?).
>>> 
>>> This is the option I would use. The other one is much, much more work. In 
>>> this option, you only have to:
>>> 
>>> (a) modify the mapper so you can specify the location of the proc being 
>>> restarted. The resilient mapper module will be handling the restart - if 
>>> you look at orte/mca/rmaps/resilient/rmaps_resilient.c, you can see the 
>>> code doing the "replacement" and modify accordingly.
>>> 
>>> (b) add any required info about your checkpoint to the launch message. This 
>>> gets created in orte/mca/odls/base/odls_base_default_fns.c, the 
>>> "get_add_procs_data" function (at the top of the file).
>>> 
>>> (c) modify the launch code to handle your checkpoint, if required - see the 
>>> file in (b), the "construct_child" and "launch" functions.
>>> 
>>> HTH
>>> Ralph
>>> 
>>> 
>>>> 
>>>> Second Option
>>>> 
>>>> Modify one of the spawn variations(probably the remote_spawn from rsh) in 
>>>> the PLM framework and then use the orted_comm to command a remote_spawn in 
>>>> the protector, but i don't know here how to update the info so everyone 
>>>> knows about the change or how this is managed.
>>>> 
>>>> I might be very wrong in what I said, my apologies if so.
>>>> 
>>>> Thanks a lot for all the help.
>>>> 
>>>> Best regards.
>>>> 
>>>> Hugo Meyer
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> [email protected]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> [email protected]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>

Re: [OMPI devel] Add child to another parent.

Reply via email to