On Dec 15, 2009, at 6:31 PM, Jeff Squyres wrote:

> On Dec 15, 2009, at 2:20 PM, Ralph Castain wrote:
> 
>> It probably should be done at a lower level, but it begs a different 
>> question. For example, I've created the capability  in the new cluster 
>> manager to detect interfaces that are lost, ride through the problem by 
>> moving affected procs to other nodes (reconnecting ORTE-level comm), and 
>> move procs back if/when nodes reappear. So someone can remove a node 
>> "on-the-fly" and replace that hardware with another node without having to 
>> stop and restart the job, etc. A lot of that infrastructure is now down 
>> inside ORTE, though a few key pieces remain in the ORCM code base (and most 
>> likely will stay there).
>> 
>> Works great - unless it is an MPI job. If we can figure out a way for the 
>> MPI procs to (a) be properly restarted on the "new" node, and (b) update the 
>> BTL connection info on the other MPI procs in the job, then we would be good 
>> to go...
>> 
>> Trivial problem, I am sure :-)
> 
> ...actually, the groundwork is there with Josh's work, isn't it?  I think the 
> real issue is handling un-graceful BTL failures properly.  I'm guessing 
> that's the biggest piece that isn't done...?

Think so....not sure how to update the BTL's with the new info, but perhaps 
Josh has already done that problem.

> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to