Dear all,
I am trying to checkpoint an mpi application at specific points
in my program. So, i created a small function as follows:
void mychkpt()
{
system ("ompi-checkpoint -v `pidof mpirun`");
}
and I am calling it in my MPI application at specific points. e.g
##############
printf("I am processor no %d of a total of %d procs \n", rank, size);
system("sleep 6");
mychkpt();
printf("I am processor no %d of a total of %d procs \n", rank, size);
system("sleep 4");
mychkpt();
#############
If i do:
mpirun -am ft-enable-cr -np 1 mpisleepts0,
it works fine. but if i use more than 1 node there is a problem. e.g
mpirun -am ft-enable-cr -np 2 mpisleepts0
I get
################
I am processor no 0 of a total of 2 procs
I am processor no 1 of a total of 2 procs
[jean:13673] orte_checkpoint: Checkpointing...
[jean:13673] PID 13647
[jean:13673] Connected to Mpirun [[28355,0],0]
[jean:13673] orte_checkpoint: notify_hnp: Contact Head Node Process PID 13647
[jean:13673] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid
[INVALID]
[jean:13673] orte_checkpoint: hnp_receiver: Receive a command message.
[jean:13673] orte_checkpoint: hnp_receiver: Status Update.
[jean:13673] Requested - Global Snapshot Reference: (null)
[jean:13673] orte_checkpoint: hnp_receiver: Receive a command message.
[jean:13673] orte_checkpoint: hnp_receiver: Status Update.
[jean:13673] Pending - Global Snapshot Reference: (null)
[jean:13673] orte_checkpoint: hnp_receiver: Receive a command message.
[jean:13673] orte_checkpoint: hnp_receiver: Status Update.
[jean:13673] Running - Global Snapshot Reference: (null)
[jean:13672] orte_checkpoint: Checkpointing...
[jean:13672] PID 13647
[jean:13672] Connected to Mpirun [[28355,0],0]
[jean:13672] orte_checkpoint: notify_hnp: Contact Head Node Process PID 13647
[jean:13672] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid
[INVALID]
[jean:13673] orte_checkpoint: hnp_receiver: Receive a command message.
[jean:13673] orte_checkpoint: hnp_receiver: Status Update.
[jean:13673] File Transfer - Global Snapshot Reference: (null)
[jean:13673] orte_checkpoint: hnp_receiver: Receive a command message.
[jean:13673] orte_checkpoint: hnp_receiver: Status Update.
[jean:13673] Finished - Global Snapshot Reference:
ompi_global_snapshot_13647.ckptSnapshot Ref.: 0
ompi_global_snapshot_13647.ckpt
^Xmpirun: killing job...
#################
It runs the function twice simultaneously which try to call the checkpointing
process twice...thus causing problems.
How can i ensure that the checkpointing process is called only once when there
are more than one process running?
Please given me some ideas on it.
Thank you
Jean