Dear All,
         I am trying to checkpoint am MPI application which has two processes 
each running on two seperate hosts.

I run the application as follows:

raj@sun32:~$ mpirun -am ft-enable-cr -np 2 --hostfile sunhost -mca btl ^openib 
-mca snapc_base_global_snapshot_dir /tmp m.

and I trigger the checkpoint as follows:

raj@sun32:~$ ompi-checkpoint -v 30010


The following happens displaying two errors which checkpointng the application:


##############################################
I am processor no 0 of a total of 2 procs on host sun32
I am processor no 1 of a total of 2 procs on host sun06
I am processorrrrrrrr no 0 of a total of 2 procs on host sun32 
I am processorrrrrrrr no 1 of a total of 2 procs on host sun06 

[sun32:30010] Error: expected_component: PID information unavailable!
[sun32:30010] Error: expected_component: Component Name information unavailable!

I am processssssssssssor no 1 of a total of 2 procs on host sun06
I am processssssssssssor no 0 of a total of 2 procs on host sun32
bye 
bye 
############################################




when I try to restart the application from the checkpointed file, I get the 
following:

raj@sun32:~$ ompi-restart ompi_global_snapshot_30010.ckpt
--------------------------------------------------------------------------
Error: The filename (opal_snapshot_1.ckpt) is invalid because either you have 
not provided a filename
       or provided an invalid filename.
       Please see --help for usage.

--------------------------------------------------------------------------
I am processssssssssssor no 0 of a total of 2 procs on host sun32
bye 


I would very appreciate if you could give me some ideas on how to checkpoint 
and restart MPI application running on multiple hosts.

Thank you

Regards,

Raj



Reply via email to