Dear all, I am trying to checkpoint an mpi application at specific points in my program. So, i created a small function as follows:
void mychkpt() { system ("ompi-checkpoint -v `pidof mpirun`"); } and I am calling it in my MPI application at specific points. e.g ############## printf("I am processor no %d of a total of %d procs \n", rank, size); system("sleep 6"); mychkpt(); printf("I am processor no %d of a total of %d procs \n", rank, size); system("sleep 4"); mychkpt(); ############# If i do: mpirun -am ft-enable-cr -np 1 mpisleepts0, it works fine. but if i use more than 1 node there is a problem. e.g mpirun -am ft-enable-cr -np 2 mpisleepts0 I get ################ I am processor no 0 of a total of 2 procs I am processor no 1 of a total of 2 procs [jean:13673] orte_checkpoint: Checkpointing... [jean:13673] PID 13647 [jean:13673] Connected to Mpirun [[28355,0],0] [jean:13673] orte_checkpoint: notify_hnp: Contact Head Node Process PID 13647 [jean:13673] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID] [jean:13673] orte_checkpoint: hnp_receiver: Receive a command message. [jean:13673] orte_checkpoint: hnp_receiver: Status Update. [jean:13673] Requested - Global Snapshot Reference: (null) [jean:13673] orte_checkpoint: hnp_receiver: Receive a command message. [jean:13673] orte_checkpoint: hnp_receiver: Status Update. [jean:13673] Pending - Global Snapshot Reference: (null) [jean:13673] orte_checkpoint: hnp_receiver: Receive a command message. [jean:13673] orte_checkpoint: hnp_receiver: Status Update. [jean:13673] Running - Global Snapshot Reference: (null) [jean:13672] orte_checkpoint: Checkpointing... [jean:13672] PID 13647 [jean:13672] Connected to Mpirun [[28355,0],0] [jean:13672] orte_checkpoint: notify_hnp: Contact Head Node Process PID 13647 [jean:13672] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID] [jean:13673] orte_checkpoint: hnp_receiver: Receive a command message. [jean:13673] orte_checkpoint: hnp_receiver: Status Update. [jean:13673] File Transfer - Global Snapshot Reference: (null) [jean:13673] orte_checkpoint: hnp_receiver: Receive a command message. [jean:13673] orte_checkpoint: hnp_receiver: Status Update. [jean:13673] Finished - Global Snapshot Reference: ompi_global_snapshot_13647.ckptSnapshot Ref.: 0 ompi_global_snapshot_13647.ckpt ^Xmpirun: killing job... ################# It runs the function twice simultaneously which try to call the checkpointing process twice...thus causing problems. How can i ensure that the checkpointing process is called only once when there are more than one process running? Please given me some ideas on it. Thank you Jean