So the HNP/mpirun knows when the job is fully restarted. The code for
that is at:
  orte/mca/snapc/full/snapc_full_global.c:1758

This should prevent ompi-checkpoint from starting a checkpoint before
the restart is complete. I suspect those are the errors that you are
talking about.

Since you are triggering the checkpoint external to the application,
you will need to add code to the HNP/mpirun around the code cited
above to notify you of the restart completion. There is no such
mechanism for an external tool to know that the job has successfully
finished the restart in the current trunk. If you come up with a good
solution, send us a patch and we can try to work it into the trunk.

-- Josh

On Wed, Jun 15, 2011 at 5:36 PM, Kishor Kharbas <kkha...@ncsu.edu> wrote:
> Hello !
> I am working on some simulations where I have to perform periodic
> kill-restart and checkpointing on a MPI application.
> As a checkpoint can take place immediately after restart I need some way to
> know whether ompi-restart of the application is complete.
> If I do not ensure that restart of all application processes is complete,
> ompi-checkpoint fails after throwing a slew of errors.
> Can someone please suggest an idea for having some kind of notification
> indicating restarts have complete (in the sense that checkpointing .
> Thank you,
> Kishor
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey

Reply via email to