As far as I know what Josh did is slightly different. In the case of a complete 
restart (where all processes are restarted from a checkpoint), he setup and 
rewire a new set of BTLs.

However, it happens that we do have some code to rewire the MPI processes in 
case of failure(s) in one of UTK projects. I'll have to talk with the team 
here, to see if at this point there is something we can contribute regarding 
this matter.

  george.

On Dec 15, 2009, at 21:08 , Ralph Castain wrote:

> 
> On Dec 15, 2009, at 6:31 PM, Jeff Squyres wrote:
> 
>> On Dec 15, 2009, at 2:20 PM, Ralph Castain wrote:
>> 
>>> It probably should be done at a lower level, but it begs a different 
>>> question. For example, I've created the capability  in the new cluster 
>>> manager to detect interfaces that are lost, ride through the problem by 
>>> moving affected procs to other nodes (reconnecting ORTE-level comm), and 
>>> move procs back if/when nodes reappear. So someone can remove a node 
>>> "on-the-fly" and replace that hardware with another node without having to 
>>> stop and restart the job, etc. A lot of that infrastructure is now down 
>>> inside ORTE, though a few key pieces remain in the ORCM code base (and most 
>>> likely will stay there).
>>> 
>>> Works great - unless it is an MPI job. If we can figure out a way for the 
>>> MPI procs to (a) be properly restarted on the "new" node, and (b) update 
>>> the BTL connection info on the other MPI procs in the job, then we would be 
>>> good to go...
>>> 
>>> Trivial problem, I am sure :-)
>> 
>> ...actually, the groundwork is there with Josh's work, isn't it?  I think 
>> the real issue is handling un-graceful BTL failures properly.  I'm guessing 
>> that's the biggest piece that isn't done...?
> 
> Think so....not sure how to update the BTL's with the new info, but perhaps 
> Josh has already done that problem.
> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to