On Dec 15, 2009, at 2:20 PM, Ralph Castain wrote:

> It probably should be done at a lower level, but it begs a different 
> question. For example, I've created the capability  in the new cluster 
> manager to detect interfaces that are lost, ride through the problem by 
> moving affected procs to other nodes (reconnecting ORTE-level comm), and move 
> procs back if/when nodes reappear. So someone can remove a node "on-the-fly" 
> and replace that hardware with another node without having to stop and 
> restart the job, etc. A lot of that infrastructure is now down inside ORTE, 
> though a few key pieces remain in the ORCM code base (and most likely will 
> stay there).
> 
> Works great - unless it is an MPI job. If we can figure out a way for the MPI 
> procs to (a) be properly restarted on the "new" node, and (b) update the BTL 
> connection info on the other MPI procs in the job, then we would be good to 
> go...
> 
> Trivial problem, I am sure :-)

...actually, the groundwork is there with Josh's work, isn't it?  I think the 
real issue is handling un-graceful BTL failures properly.  I'm guessing that's 
the biggest piece that isn't done...?

-- 
Jeff Squyres
jsquy...@cisco.com


Reply via email to