Hi,
I have a question about checkpoint-restart operation with opem-mpi. I
hope this is an apropriate forum for my question.

I do not have access to recopmile the kernel or load kernel modules,
so I would like to use the condor checkpoint-restart library. Can
that me made to work with openmpi's checkpoint-restart
infrastructure?

The condor library, upon recept of a signal or calling its checkpoint
function from within the program, generates a file containing the
complete (as complete as possible) state of the process, including
the state of libraries, e.g. openmpi. On restart, the process
image/state is loaded into memory and execution is resumed at the
checkpoint location.

On restart, I assume that some information in the mpi-state may need
to be reinitalized, since e.g. the names of the hosts of the
mpi-process, and pids of possible support processes will have
changed.

Is this tricky to fix (that code must somehow be there for the BLCR
compatibility)?

Perhaps it can be achieved by (in violation of the mpi-standard)
calling MPI_Finalize before the checkpoint, and MPI_Init after
restart? This seems like a conceptually appealing solution, but may
not be allowed nor to the correct thing in openmpi?!

  Thanks for any ideas/help/pointers to more information!

         Tomas

Reply via email to