> -----Original Message-----
> From: devel-boun...@open-mpi.org 
> [mailto:devel-boun...@open-mpi.org] On Behalf Of Jeff Squyres
> Sent: Tuesday, December 15, 2009 6:32 PM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] carto vs. hwloc
> 
> On Dec 15, 2009, at 2:20 PM, Ralph Castain wrote:
> 
> > It probably should be done at a lower level, but it begs a 
> different question. For example, I've created the capability  
> in the new cluster manager to detect interfaces that are 
> lost, ride through the problem by moving affected procs to 
> other nodes (reconnecting ORTE-level comm), and move procs 
> back if/when nodes reappear. So someone can remove a node 
> "on-the-fly" and replace that hardware with another node 
> without having to stop and restart the job, etc. A lot of 
> that infrastructure is now down inside ORTE, though a few key 
> pieces remain in the ORCM code base (and most likely will stay there).
> > 
> > Works great - unless it is an MPI job. If we can figure out 
> a way for the MPI procs to (a) be properly restarted on the 
> "new" node, and (b) update the BTL connection info on the 
> other MPI procs in the job, then we would be good to go...
> > 
> > Trivial problem, I am sure :-)
> 
> ...actually, the groundwork is there with Josh's work, isn't 
> it?  I think the real issue is handling un-graceful BTL 
> failures properly.  I'm guessing that's the biggest piece 
> that isn't done...?

Precisely.  Why the BTL, or why not at the PTL? (Where these issues rightly
belong, IMO).

Ken Lloyd


> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to