Dear all,
              I am trying to checkpoint an mpi application at specific points 
in my program. So, i created a small function as follows:

void mychkpt()
system ("ompi-checkpoint -v `pidof mpirun`");

and I am calling it in my MPI application at specific points. e.g

printf("I am processor no %d of a total of %d procs \n", rank, size);
system("sleep 6");
printf("I am processor no %d of a total of %d procs \n", rank, size);
system("sleep 4");

If i do:
 mpirun -am ft-enable-cr -np 1 mpisleepts0,

it works fine. but if i use more than 1 node there is a problem. e.g

mpirun -am ft-enable-cr -np 2 mpisleepts0

I get

I am processor no 0 of a total of 2 procs 
I am processor no 1 of a total of 2 procs 
[jean:13673] orte_checkpoint: Checkpointing...
[jean:13673]      PID 13647
[jean:13673]      Connected to Mpirun [[28355,0],0]
[jean:13673] orte_checkpoint: notify_hnp: Contact Head Node Process PID 13647
[jean:13673] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid 
[jean:13673] orte_checkpoint: hnp_receiver: Receive a command message.
[jean:13673] orte_checkpoint: hnp_receiver: Status Update.
[jean:13673]                 Requested - Global Snapshot Reference: (null)
[jean:13673] orte_checkpoint: hnp_receiver: Receive a command message.
[jean:13673] orte_checkpoint: hnp_receiver: Status Update.
[jean:13673]                   Pending - Global Snapshot Reference: (null)
[jean:13673] orte_checkpoint: hnp_receiver: Receive a command message.
[jean:13673] orte_checkpoint: hnp_receiver: Status Update.
[jean:13673]                   Running - Global Snapshot Reference: (null)
[jean:13672] orte_checkpoint: Checkpointing...
[jean:13672]      PID 13647
[jean:13672]      Connected to Mpirun [[28355,0],0]
[jean:13672] orte_checkpoint: notify_hnp: Contact Head Node Process PID 13647
[jean:13672] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid 
[jean:13673] orte_checkpoint: hnp_receiver: Receive a command message.
[jean:13673] orte_checkpoint: hnp_receiver: Status Update.
[jean:13673]             File Transfer - Global Snapshot Reference: (null)
[jean:13673] orte_checkpoint: hnp_receiver: Receive a command message.
[jean:13673] orte_checkpoint: hnp_receiver: Status Update.
[jean:13673]                  Finished - Global Snapshot Reference: 
ompi_global_snapshot_13647.ckptSnapshot Ref.:   0 
^Xmpirun: killing job...

It runs the function twice simultaneously which try to call the checkpointing 
process twice...thus causing problems.

How can i ensure that the checkpointing process is called only once when there 
are more than one process running?

Please given me some ideas on it. 

Thank you



Reply via email to