Currently, I am working on process migration and automatic recovery based on 
checkpoint/restart. WRT the PML stack, this works by rewiring the BTLs after 
restart of the migrated/recovered MPI process(es). There is a fair amount of 
work in getting this right with respect to both the runtime and the OMPI layer 
(particularly the modex). For the automatic recovery with C/R we will, at 
first, require the restart of all processes in the job [for consistency]. For 
migration, only those processes moving will need to be restarted, all others 
may be blocked.

I think what you are looking for is the ability to lose a process and replace 
it without restarting all the rest of the processes. This would require a bit 
more work beyond what I am currently working on. Since you will need to flush 
the PML/BML/BTL stack of latent messages, etc. The message logging work by UTK 
should do this anyway (if they use uncoordinated C/R+message logging), but they 
will have to fill in the details on that project.

-- Josh

On Dec 16, 2009, at 1:32 AM, George Bosilca wrote:

> As far as I know what Josh did is slightly different. In the case of a 
> complete restart (where all processes are restarted from a checkpoint), he 
> setup and rewire a new set of BTLs.
> 
> However, it happens that we do have some code to rewire the MPI processes in 
> case of failure(s) in one of UTK projects. I'll have to talk with the team 
> here, to see if at this point there is something we can contribute regarding 
> this matter.
> 
>  george.
> 
> On Dec 15, 2009, at 21:08 , Ralph Castain wrote:
> 
>> 
>> On Dec 15, 2009, at 6:31 PM, Jeff Squyres wrote:
>> 
>>> On Dec 15, 2009, at 2:20 PM, Ralph Castain wrote:
>>> 
>>>> It probably should be done at a lower level, but it begs a different 
>>>> question. For example, I've created the capability  in the new cluster 
>>>> manager to detect interfaces that are lost, ride through the problem by 
>>>> moving affected procs to other nodes (reconnecting ORTE-level comm), and 
>>>> move procs back if/when nodes reappear. So someone can remove a node 
>>>> "on-the-fly" and replace that hardware with another node without having to 
>>>> stop and restart the job, etc. A lot of that infrastructure is now down 
>>>> inside ORTE, though a few key pieces remain in the ORCM code base (and 
>>>> most likely will stay there).
>>>> 
>>>> Works great - unless it is an MPI job. If we can figure out a way for the 
>>>> MPI procs to (a) be properly restarted on the "new" node, and (b) update 
>>>> the BTL connection info on the other MPI procs in the job, then we would 
>>>> be good to go...
>>>> 
>>>> Trivial problem, I am sure :-)
>>> 
>>> ...actually, the groundwork is there with Josh's work, isn't it?  I think 
>>> the real issue is handling un-graceful BTL failures properly.  I'm guessing 
>>> that's the biggest piece that isn't done...?
>> 
>> Think so....not sure how to update the BTL's with the new info, but perhaps 
>> Josh has already done that problem.
>> 
>>> 
>>> -- 
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to