So I can point you to some of the work that I did while at Indiana University 
to support process migration in Open MPI in a coordinated manner. This should 
introduce you to some of the internal pieces that fit together to provide this 
support.

The transparent C/R in Open MPI webpage from IU is a good place to start:
   http://osl.iu.edu/research/ft/ompi-cr/index.php

>From there you will find a link to a couple papers that should get you 
>started. In particular "A Composable Runtime Recovery Policy Framework 
>Supporting Resilient HPC Applications" discusses how the ORTE ErrMgr framework 
>was used (initially) to provide process migration and automatic recovery. The 
>actual code in the Open MPI trunk is slightly different. Instead of using 
>different components of the ErrMgr framework (i.e., autor, crmig, stable) we 
>just rolled it all into the existing components (i.e., hnp, orted, app). But 
>all the code can be found in those component directories.

If you want a more general overview of the C/R system in Open MPI, I would 
start with the paper "The Design and Implementation of Checkpoint/Restart 
Process Fault Tolerance for Open MPI" which provides a high level view of the 
architecture (combined with the paper above you will have a fairly complete 
picture of the design). The C/R infrastructure currently only supports 
coordinated C/R, but was designed to be more extensible. So if you are looking 
into uncoordinated C/R techniques you may find that many of the C/R frameworks 
in Open MPI can be reused.

That should get you started. Let us know if you have any further questions.

-- Josh

On Jan 6, 2011, at 3:19 PM, Hugo Meyer wrote:

> Thanks for the reply and don't worry about the delay.
> 
> Yeah, i supposse it wouln't be easy :(.
> But my final goal is what you are mentioning, is to stop one particular 
> process (previously checkpointed) and the migrate it to another place (node, 
> core, slot, etc.) and restart it there, but without making a coordinated 
> checkpoint. I just need to checkpoint processes in an uncoordinated way, and 
> move them.
> 
> Where can i see something about process migration in the code? or something 
> that could guide me.
> 
> Greetings.
> 
> Hugo Meyer
> 
> 2011/1/6 Jeff Squyres <jsquy...@cisco.com>
> Sorry for the delay; you wrote while many of us were on vacation and we're 
> just now starting to catch up on past mails...
> 
> I'm not entirely sure what you're trying to do.  It sounds like you're trying 
> to replace one process with another.  That's quite complicated; there will be 
> a lot of changes required in the code base to do this.
> 
> - you'll need to notify the ORTE subsystem of the process change
> - this notification will likely need to span multiple processes
> - all MPI processes will need to quiesce their communications, disconnect, 
> and reconnect
> - ...and probably other things
> 
> That being said, you might be able to leverage some of the work that's been 
> done with checkpoint/restart/migration.  It's not entirely the same thing 
> that you're doing, but it's at least similar (quiesce networks, [pretend to] 
> move a process from location A to location B, etc.).
> 
> 
> 
> On Dec 28, 2010, at 7:03 AM, Hugo Meyer wrote:
> 
> > Hello to all.
> >
> > I'm new in the forum, at least is the first time i write.
> >
> > I'm working with open mpi and I would do a little experiment, i will try to 
> > pass one process by another process.
> >
> > For example, assuming that there are 2 processes that are communicating say 
> > rank 1 and 2. And there is a process of rank 3, I would like the rank 3 (it 
> > could be assumed that this node is marked down at the initial hostfile) 
> > took the place of rank 2, and rank 1 still think that he is communicating 
> > with rank 2 when in fact is communicating with the rank 3.
> >
> > I guess I'll have to modify tables as orte_job_map_t and orte_proc_t, but I 
> > wanted to know if someone already has experience doing something similar, 
> > and can guide me at least.
> >
> > The communication between processes, in principle, would be irrelevant, so 
> > i will not need to use checkpoints / restarts for now.
> >
> > Greetings
> >
> > Hugo Meyer
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

------------------------------------
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey


Reply via email to