On Dec 15, 2009, at 2:20 PM, Ralph Castain wrote: > It probably should be done at a lower level, but it begs a different > question. For example, I've created the capability in the new cluster > manager to detect interfaces that are lost, ride through the problem by > moving affected procs to other nodes (reconnecting ORTE-level comm), and move > procs back if/when nodes reappear. So someone can remove a node "on-the-fly" > and replace that hardware with another node without having to stop and > restart the job, etc. A lot of that infrastructure is now down inside ORTE, > though a few key pieces remain in the ORCM code base (and most likely will > stay there). > > Works great - unless it is an MPI job. If we can figure out a way for the MPI > procs to (a) be properly restarted on the "new" node, and (b) update the BTL > connection info on the other MPI procs in the job, then we would be good to > go... > > Trivial problem, I am sure :-)
...actually, the groundwork is there with Josh's work, isn't it? I think the real issue is handling un-graceful BTL failures properly. I'm guessing that's the biggest piece that isn't done...? -- Jeff Squyres jsquy...@cisco.com