So the HNP/mpirun knows when the job is fully restarted. The code for that is at: orte/mca/snapc/full/snapc_full_global.c:1758
This should prevent ompi-checkpoint from starting a checkpoint before the restart is complete. I suspect those are the errors that you are talking about. Since you are triggering the checkpoint external to the application, you will need to add code to the HNP/mpirun around the code cited above to notify you of the restart completion. There is no such mechanism for an external tool to know that the job has successfully finished the restart in the current trunk. If you come up with a good solution, send us a patch and we can try to work it into the trunk. -- Josh On Wed, Jun 15, 2011 at 5:36 PM, Kishor Kharbas <kkha...@ncsu.edu> wrote: > Hello ! > I am working on some simulations where I have to perform periodic > kill-restart and checkpointing on a MPI application. > As a checkpoint can take place immediately after restart I need some way to > know whether ompi-restart of the application is complete. > If I do not ensure that restart of all application processes is complete, > ompi-checkpoint fails after throwing a slew of errors. > Can someone please suggest an idea for having some kind of notification > indicating restarts have complete (in the sense that checkpointing . > Thank you, > Kishor > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey