Hi, I have a question about checkpoint-restart operation with opem-mpi. I hope this is an apropriate forum for my question.
I do not have access to recopmile the kernel or load kernel modules, so I would like to use the condor checkpoint-restart library. Can that me made to work with openmpi's checkpoint-restart infrastructure? The condor library, upon recept of a signal or calling its checkpoint function from within the program, generates a file containing the complete (as complete as possible) state of the process, including the state of libraries, e.g. openmpi. On restart, the process image/state is loaded into memory and execution is resumed at the checkpoint location. On restart, I assume that some information in the mpi-state may need to be reinitalized, since e.g. the names of the hosts of the mpi-process, and pids of possible support processes will have changed. Is this tricky to fix (that code must somehow be there for the BLCR compatibility)? Perhaps it can be achieved by (in violation of the mpi-standard) calling MPI_Finalize before the checkpoint, and MPI_Init after restart? This seems like a conceptually appealing solution, but may not be allowed nor to the correct thing in openmpi?! Thanks for any ideas/help/pointers to more information! Tomas