> -----Original Message-----
> From: devel-boun...@open-mpi.org
> [mailto:devel-boun...@open-mpi.org] On Behalf Of Jeff Squyres
> Sent: Tuesday, December 15, 2009 6:32 PM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] carto vs. hwloc
>
> On Dec 15, 2009, at 2:20 PM, Ralph Castain wrote:
>
> > It probably should be done at a lower level, but it begs a
> different question. For example, I've created the capability
> in the new cluster manager to detect interfaces that are
> lost, ride through the problem by moving affected procs to
> other nodes (reconnecting ORTE-level comm), and move procs
> back if/when nodes reappear. So someone can remove a node
> "on-the-fly" and replace that hardware with another node
> without having to stop and restart the job, etc. A lot of
> that infrastructure is now down inside ORTE, though a few key
> pieces remain in the ORCM code base (and most likely will stay there).
> >
> > Works great - unless it is an MPI job. If we can figure out
> a way for the MPI procs to (a) be properly restarted on the
> "new" node, and (b) update the BTL connection info on the
> other MPI procs in the job, then we would be good to go...
> >
> > Trivial problem, I am sure :-)
>
> ...actually, the groundwork is there with Josh's work, isn't
> it? I think the real issue is handling un-graceful BTL
> failures properly. I'm guessing that's the biggest piece
> that isn't done...?
Precisely. Why the BTL, or why not at the PTL? (Where these issues rightly
belong, IMO).
Ken Lloyd
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel