Ok Ralph.
Thanks a lot, i will resend this message with a new subject.
Best Regards.
Hugo
2011/3/31 Ralph Castain
> Sorry - should have included the devel list when I sent this.
>
>
> On Mar 30, 2011, at 6:11 PM, Ralph Castain wrote:
>
> I'm not the expert on this area - Josh is, so I'll defer to him. I did take
> a quick glance at the sstore framework, though, and it looks like there are
> some params you could set that might help.
>
> "ompi_info --param sstore all"
>
> should tell you what's available. Also, note that Josh created a man page
> to explain how sstore works. It's in section 7, looks like "man orte_sstore"
> should get it.
>
>
> On Mar 30, 2011, at 3:09 PM, Hugo Meyer wrote:
>
> Hello again.
>
> I'm working in the launch code to handle my checkpoints, but i'm a little
> stuck in how to set the path to my checkpoint and the executable
> (ompi_blcr_context.PID). I take a look at the code in
> odls_base_default_fns.c and this piece of code took my attention:
>
> #if OPAL_ENABLE_FT_CR == 1
> /*
> * OPAL CRS components need the opportunity to take action
> before a process
> * is forked.
> * Needs access to:
> * - Environment
> * - Rank/ORTE Name
> * - Binary to exec
> */
> if( NULL != opal_crs.crs_prelaunch ) {
> if( OPAL_SUCCESS != (rc =
> opal_crs.crs_prelaunch(child->name->vpid,
>
> orte_sstore_base_prelaunch_location,
>
> &(app->app),
>
> &(app->cwd),
>
> &(app->argv),
>
> &(app->env) ) ) ) {
> ORTE_ERROR_LOG(rc);
> goto CLEANUP;
> }
> }
> #endif
>
>
> But i didn't find out how to set orte_sstore_base_prelaunch_location, i now
> that initially this is set in the sstore_base_open. For example, as i'm
> transfering my checkpoint from one node to another, i store the checkpoint
> that has to be restore in /tmp/1/ and it has a name
> like ompi_blcr_context.PID.
>
> Is there any function that i didn't see that allows me to do this? I'm
> asking this because I do not want to change the signature of the functions
> to pass the details of the checkpoint and the PID.
>
> Best Regards.
>
> Hugo Meyer
>
> 2011/3/30 Hugo Meyer
>
>> Thanks Ralph.
>> I have finished the (a) point, and now its working, now i have to work to
>> relaunch from my checkpoint as you said.
>>
>> Best regards.
>>
>> Hugo Meyer
>>
>>
>> 2011/3/29 Ralph Castain
>>
>>> The resilient mapper -only- works on procs being restarted - it cannot
>>> map a job for its initial launch. You shouldn't set any rmaps flag and
>>> things will work correctly - the default round-robin mapper will map the
>>> initial launch, and then the resilient mapper will handle restarts.
>>>
>>>
>>> On Mar 29, 2011, at 5:18 AM, Hugo Meyer wrote:
>>>
>>> Ralph.
>>>
>>> I'm having a problem when i try to select the rmaps resilient to be used:
>>>
>>> /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -v -np 4 --hostfile
>>> ../hostfile --bynode -mca rmaps resilient -mca vprotocol receiver -mca plm
>>> rsh -mca routed cm ./coll 6 10 2>out.txt
>>>
>>>
>>> I get this as error:
>>>
>>> [clus9:25568] [[53334,0],0] hostfile: checking hostfile ../hostfile for
>>> nodes
>>>
>>> --
>>> Your job failed to map. Either no mapper was available, or none
>>> of the available mappers was able to perform the requested
>>> mapping operation. This can happen if you request a map type
>>> (e.g., loadbalance) and the corresponding mapper was not built.
>>>
>>>
>>> --
>>> [clus9:25568] errmgr:hnp:update_state() [[53334,0],0]) --- App.
>>> Process state updated for process NULL
>>> [clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state
>>> NEVER LAUNCHED for proc NULL state UNDEFINED pid 0 exit_code 1
>>> [clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state
>>> NEVER LAUNCHED
>>> [clus9:25568] [[53334,0],0] errmgr:hnp: abort called on job [53334,0]
>>> with status 1
>>>
>>>
>>> Is there a flag that i'm not turning on? or a component that i should
>>> have selected?
>>>
>>> Thanks again.
>>>
>>> Hugo Meyer
>>>
>>>
>>> 2011/3/26 Hugo Meyer
>>>
Ok Ralph.
Thanks a lot for your help, i will do as you said and then let you know
how it goes.
Best Regards.
Hugo Meyer
2011/3/25 Ralph Castain
>
> On Mar 25, 2011, at 10:48 AM, Hugo Meyer wrote:
>
> From what you've described before, I suspect all you'll need to do is
>> add some code in orte/mca/odls/base/odls_base_default_fns.c that (a)
>> checks
>> to see if a process in the launch message is being relocated (the
>> construct_child_list code does that already), and then (b) sends the
>> required info to all local child processes so they can take appropriate