Dear All, I am trying to checkpoint am MPI application which has two processes each running on two seperate hosts.
I run the application as follows: raj@sun32:~$ mpirun -am ft-enable-cr -np 2 --hostfile sunhost -mca btl ^openib -mca snapc_base_global_snapshot_dir /tmp m. and I trigger the checkpoint as follows: raj@sun32:~$ ompi-checkpoint -v 30010 The following happens displaying two errors which checkpointng the application: ############################################## I am processor no 0 of a total of 2 procs on host sun32 I am processor no 1 of a total of 2 procs on host sun06 I am processorrrrrrrr no 0 of a total of 2 procs on host sun32 I am processorrrrrrrr no 1 of a total of 2 procs on host sun06 [sun32:30010] Error: expected_component: PID information unavailable! [sun32:30010] Error: expected_component: Component Name information unavailable! I am processssssssssssor no 1 of a total of 2 procs on host sun06 I am processssssssssssor no 0 of a total of 2 procs on host sun32 bye bye ############################################ when I try to restart the application from the checkpointed file, I get the following: raj@sun32:~$ ompi-restart ompi_global_snapshot_30010.ckpt -------------------------------------------------------------------------- Error: The filename (opal_snapshot_1.ckpt) is invalid because either you have not provided a filename or provided an invalid filename. Please see --help for usage. -------------------------------------------------------------------------- I am processssssssssssor no 0 of a total of 2 procs on host sun32 bye I would very appreciate if you could give me some ideas on how to checkpoint and restart MPI application running on multiple hosts. Thank you Regards, Raj