Re: [OMPI devel] Add child to another parent.

2011-03-31 Thread Hugo Meyer
Ok Ralph.
Thanks a lot, i will resend this message with a new subject.

Best Regards.

Hugo

2011/3/31 Ralph Castain 

> Sorry - should have included the devel list when I sent this.
>
>
> On Mar 30, 2011, at 6:11 PM, Ralph Castain wrote:
>
> I'm not the expert on this area - Josh is, so I'll defer to him. I did take
> a quick glance at the sstore framework, though, and it looks like there are
> some params you could set that might help.
>
> "ompi_info --param sstore all"
>
> should tell you what's available. Also, note that Josh created a man page
> to explain how sstore works. It's in section 7, looks like "man orte_sstore"
> should get it.
>
>
> On Mar 30, 2011, at 3:09 PM, Hugo Meyer wrote:
>
> Hello again.
>
> I'm working in the launch code to handle my checkpoints, but i'm a little
> stuck in how to set the path to my checkpoint and the executable
> (ompi_blcr_context.PID). I take a look at the code in
> odls_base_default_fns.c and this piece of code took my attention:
>
> #if OPAL_ENABLE_FT_CR == 1
> /*
>  * OPAL CRS components need the opportunity to take action
> before a process
>  * is forked.
>  * Needs access to:
>  *   - Environment
>  *   - Rank/ORTE Name
>  *   - Binary to exec
>  */
> if( NULL != opal_crs.crs_prelaunch ) {
> if( OPAL_SUCCESS != (rc =
> opal_crs.crs_prelaunch(child->name->vpid,
>
> orte_sstore_base_prelaunch_location,
>
> &(app->app),
>
> &(app->cwd),
>
> &(app->argv),
>
> &(app->env) ) ) ) {
> ORTE_ERROR_LOG(rc);
> goto CLEANUP;
> }
> }
> #endif
>
>
> But i didn't find out how to set orte_sstore_base_prelaunch_location, i now
> that initially this is set in the sstore_base_open. For example, as i'm
> transfering my checkpoint from one node to another, i store the checkpoint
> that has to be restore in /tmp/1/ and it has a name
> like ompi_blcr_context.PID.
>
> Is there any function that i didn't see that allows me to do this? I'm
> asking this because I do not want to change the signature of the functions
> to pass the details of the checkpoint and the PID.
>
> Best Regards.
>
> Hugo Meyer
>
> 2011/3/30 Hugo Meyer 
>
>> Thanks Ralph.
>> I have finished the (a) point, and now its working, now i have to work to
>> relaunch from my checkpoint as you said.
>>
>> Best regards.
>>
>> Hugo Meyer
>>
>>
>> 2011/3/29 Ralph Castain 
>>
>>>  The resilient mapper -only- works on procs being restarted - it cannot
>>> map a job for its initial launch. You shouldn't set any rmaps flag and
>>> things will work correctly - the default round-robin mapper will map the
>>> initial launch, and then the resilient mapper will handle restarts.
>>>
>>>
>>> On Mar 29, 2011, at 5:18 AM, Hugo Meyer wrote:
>>>
>>> Ralph.
>>>
>>> I'm having a problem when i try to select the rmaps resilient to be used:
>>>
>>> /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -v -np 4 --hostfile
>>> ../hostfile --bynode -mca rmaps resilient -mca vprotocol receiver -mca plm
>>> rsh -mca routed cm ./coll 6 10 2>out.txt
>>>
>>>
>>> I get this as error:
>>>
>>> [clus9:25568] [[53334,0],0] hostfile: checking hostfile ../hostfile for
>>> nodes
>>>
>>> --
>>> Your job failed to map. Either no mapper was available, or none
>>> of the available mappers was able to perform the requested
>>> mapping operation. This can happen if you request a map type
>>> (e.g., loadbalance) and the corresponding mapper was not built.
>>>
>>>
>>> --
>>> [clus9:25568] errmgr:hnp:update_state() [[53334,0],0]) --- App.
>>> Process state updated for process NULL
>>> [clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state
>>> NEVER LAUNCHED for proc NULL state UNDEFINED pid 0 exit_code 1
>>> [clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state
>>> NEVER LAUNCHED
>>> [clus9:25568] [[53334,0],0] errmgr:hnp: abort called on job [53334,0]
>>> with status 1
>>>
>>>
>>> Is there a flag that i'm not turning on? or a component that i should
>>> have selected?
>>>
>>> Thanks again.
>>>
>>> Hugo Meyer
>>>
>>>
>>> 2011/3/26 Hugo Meyer 
>>>
 Ok Ralph.

 Thanks a lot for your help, i will do as you said and then let you know
 how it goes.

 Best Regards.

 Hugo Meyer


 2011/3/25 Ralph Castain 

>
> On Mar 25, 2011, at 10:48 AM, Hugo Meyer wrote:
>
> From what you've described before, I suspect all you'll need to do is
>> add some code in orte/mca/odls/base/odls_base_default_fns.c that (a) 
>> checks
>> to see if a process in the launch message is being relocated (the
>> construct_child_list code does that already), and then (b) sends the
>> required info to all local child processes so they can take appropriate

[OMPI devel] Setting Checkpoint path and executables

2011-03-31 Thread Hugo Meyer
Hello again.

I'm working in the launch code to handle my checkpoints, but i'm a little
stuck in how to set the path to my checkpoint and the executable
(ompi_blcr_context.PID). I take a look at the code in
odls_base_default_fns.c and this piece of code took my attention:

#if OPAL_ENABLE_FT_CR == 1

/*

 * OPAL CRS components need the opportunity to take action
before a process

 * is forked.

 * Needs access to:

 *   - Environment

 *   - Rank/ORTE Name

 *   - Binary to exec

 */

if( NULL != opal_crs.crs_prelaunch ) {

if( OPAL_SUCCESS != (rc =
opal_crs.crs_prelaunch(child->name->vpid,


orte_sstore_base_prelaunch_location,


&(app->app),


&(app->cwd),


&(app->argv),

 &(app->env)
) ) ) {

ORTE_ERROR_LOG(rc);

goto CLEANUP;

}

}
#endif

I've seen that i can set the *location* with a MCA parameter, but as i'm
working in a no-coordinated checkpoint and also passing checkpoints from one
node to another so for example, node 2 can restore a process that was
residing on node 1. So every node has his own process's checkpoints files in
a folder, and also checkpoints from another processes residing somewhere
else in another folder.

Is there any way to set the values of where the checkpoint are stored and
his exec names taking into account my situation?

Best Regards.

Hugo Meyer