On Dec 15, 2009, at 6:31 PM, Jeff Squyres wrote: > On Dec 15, 2009, at 2:20 PM, Ralph Castain wrote: > >> It probably should be done at a lower level, but it begs a different >> question. For example, I've created the capability in the new cluster >> manager to detect interfaces that are lost, ride through the problem by >> moving affected procs to other nodes (reconnecting ORTE-level comm), and >> move procs back if/when nodes reappear. So someone can remove a node >> "on-the-fly" and replace that hardware with another node without having to >> stop and restart the job, etc. A lot of that infrastructure is now down >> inside ORTE, though a few key pieces remain in the ORCM code base (and most >> likely will stay there). >> >> Works great - unless it is an MPI job. If we can figure out a way for the >> MPI procs to (a) be properly restarted on the "new" node, and (b) update the >> BTL connection info on the other MPI procs in the job, then we would be good >> to go... >> >> Trivial problem, I am sure :-) > > ...actually, the groundwork is there with Josh's work, isn't it? I think the > real issue is handling un-graceful BTL failures properly. I'm guessing > that's the biggest piece that isn't done...?
Think so....not sure how to update the BTL's with the new info, but perhaps Josh has already done that problem. > > -- > Jeff Squyres > jsquy...@cisco.com > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel