So I can point you to some of the work that I did while at Indiana University to support process migration in Open MPI in a coordinated manner. This should introduce you to some of the internal pieces that fit together to provide this support.
The transparent C/R in Open MPI webpage from IU is a good place to start: http://osl.iu.edu/research/ft/ompi-cr/index.php >From there you will find a link to a couple papers that should get you >started. In particular "A Composable Runtime Recovery Policy Framework >Supporting Resilient HPC Applications" discusses how the ORTE ErrMgr framework >was used (initially) to provide process migration and automatic recovery. The >actual code in the Open MPI trunk is slightly different. Instead of using >different components of the ErrMgr framework (i.e., autor, crmig, stable) we >just rolled it all into the existing components (i.e., hnp, orted, app). But >all the code can be found in those component directories. If you want a more general overview of the C/R system in Open MPI, I would start with the paper "The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI" which provides a high level view of the architecture (combined with the paper above you will have a fairly complete picture of the design). The C/R infrastructure currently only supports coordinated C/R, but was designed to be more extensible. So if you are looking into uncoordinated C/R techniques you may find that many of the C/R frameworks in Open MPI can be reused. That should get you started. Let us know if you have any further questions. -- Josh On Jan 6, 2011, at 3:19 PM, Hugo Meyer wrote: > Thanks for the reply and don't worry about the delay. > > Yeah, i supposse it wouln't be easy :(. > But my final goal is what you are mentioning, is to stop one particular > process (previously checkpointed) and the migrate it to another place (node, > core, slot, etc.) and restart it there, but without making a coordinated > checkpoint. I just need to checkpoint processes in an uncoordinated way, and > move them. > > Where can i see something about process migration in the code? or something > that could guide me. > > Greetings. > > Hugo Meyer > > 2011/1/6 Jeff Squyres <jsquy...@cisco.com> > Sorry for the delay; you wrote while many of us were on vacation and we're > just now starting to catch up on past mails... > > I'm not entirely sure what you're trying to do. It sounds like you're trying > to replace one process with another. That's quite complicated; there will be > a lot of changes required in the code base to do this. > > - you'll need to notify the ORTE subsystem of the process change > - this notification will likely need to span multiple processes > - all MPI processes will need to quiesce their communications, disconnect, > and reconnect > - ...and probably other things > > That being said, you might be able to leverage some of the work that's been > done with checkpoint/restart/migration. It's not entirely the same thing > that you're doing, but it's at least similar (quiesce networks, [pretend to] > move a process from location A to location B, etc.). > > > > On Dec 28, 2010, at 7:03 AM, Hugo Meyer wrote: > > > Hello to all. > > > > I'm new in the forum, at least is the first time i write. > > > > I'm working with open mpi and I would do a little experiment, i will try to > > pass one process by another process. > > > > For example, assuming that there are 2 processes that are communicating say > > rank 1 and 2. And there is a process of rank 3, I would like the rank 3 (it > > could be assumed that this node is marked down at the initial hostfile) > > took the place of rank 2, and rank 1 still think that he is communicating > > with rank 2 when in fact is communicating with the rank 3. > > > > I guess I'll have to modify tables as orte_job_map_t and orte_proc_t, but I > > wanted to know if someone already has experience doing something similar, > > and can guide me at least. > > > > The communication between processes, in principle, would be irrelevant, so > > i will not need to use checkpoints / restarts for now. > > > > Greetings > > > > Hugo Meyer > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel ------------------------------------ Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey