Currently, I am working on process migration and automatic recovery based on checkpoint/restart. WRT the PML stack, this works by rewiring the BTLs after restart of the migrated/recovered MPI process(es). There is a fair amount of work in getting this right with respect to both the runtime and the OMPI layer (particularly the modex). For the automatic recovery with C/R we will, at first, require the restart of all processes in the job [for consistency]. For migration, only those processes moving will need to be restarted, all others may be blocked.
I think what you are looking for is the ability to lose a process and replace it without restarting all the rest of the processes. This would require a bit more work beyond what I am currently working on. Since you will need to flush the PML/BML/BTL stack of latent messages, etc. The message logging work by UTK should do this anyway (if they use uncoordinated C/R+message logging), but they will have to fill in the details on that project. -- Josh On Dec 16, 2009, at 1:32 AM, George Bosilca wrote: > As far as I know what Josh did is slightly different. In the case of a > complete restart (where all processes are restarted from a checkpoint), he > setup and rewire a new set of BTLs. > > However, it happens that we do have some code to rewire the MPI processes in > case of failure(s) in one of UTK projects. I'll have to talk with the team > here, to see if at this point there is something we can contribute regarding > this matter. > > george. > > On Dec 15, 2009, at 21:08 , Ralph Castain wrote: > >> >> On Dec 15, 2009, at 6:31 PM, Jeff Squyres wrote: >> >>> On Dec 15, 2009, at 2:20 PM, Ralph Castain wrote: >>> >>>> It probably should be done at a lower level, but it begs a different >>>> question. For example, I've created the capability in the new cluster >>>> manager to detect interfaces that are lost, ride through the problem by >>>> moving affected procs to other nodes (reconnecting ORTE-level comm), and >>>> move procs back if/when nodes reappear. So someone can remove a node >>>> "on-the-fly" and replace that hardware with another node without having to >>>> stop and restart the job, etc. A lot of that infrastructure is now down >>>> inside ORTE, though a few key pieces remain in the ORCM code base (and >>>> most likely will stay there). >>>> >>>> Works great - unless it is an MPI job. If we can figure out a way for the >>>> MPI procs to (a) be properly restarted on the "new" node, and (b) update >>>> the BTL connection info on the other MPI procs in the job, then we would >>>> be good to go... >>>> >>>> Trivial problem, I am sure :-) >>> >>> ...actually, the groundwork is there with Josh's work, isn't it? I think >>> the real issue is handling un-graceful BTL failures properly. I'm guessing >>> that's the biggest piece that isn't done...? >> >> Think so....not sure how to update the BTL's with the new info, but perhaps >> Josh has already done that problem. >> >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel