
I'm using the development version of OMPI from SVN (rev. 19857)
for executing MPI jobs on my cluster system. I'm particularly using
the checkpoint and restart feature, basing on the currentmost version
of BLCR.

The checkpointing is working pretty fine as long as I only execute
a single job on a node. If more than one MPI application is executing
on a system, ompi-checkpoint sometimes does not return, hanging forever.

Example: checkpointing with a single running application

I'm using the MPI-enabled flavor of Povray as demo application. So I'm
starting it on a node using the following command.

  mpirun -np 4 mpi-x-povray +I planet.pov -w1200 -h1000 +SP1 \
  +O planet.tga

This gives me 4 MPI processes, all running on the local node.
checkpointing it with

  ompi-checkpoint -v --term 7022

(where 7022 is the PID of the mpirun process) gives me a checkpoint
dataset ompi_global_snapshot_7022.ckpt, that can be used for restarting
the job.

The ompi-checkpoint command gives the following output:

[grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: Checkpointing...
[grid-demo-1.cit.tu-berlin.de:07480]     PID 7022
[grid-demo-1.cit.tu-berlin.de:07480]     Connected to Mpirun [[2899,0],0]
[grid-demo-1.cit.tu-berlin.de:07480]     Terminating after checkpoint
[grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: notify_hnp: Contact Head Node Process PID 7022 [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID] [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Receive a command message. [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Status Update. [grid-demo-1.cit.tu-berlin.de:07480] Requested - Global Snapshot Reference: (null) [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Receive a command message. [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Status Update. [grid-demo-1.cit.tu-berlin.de:07480] Pending (Termination) - Global Snapshot Reference: (null) [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Receive a command message. [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Status Update. [grid-demo-1.cit.tu-berlin.de:07480] Running - Global Snapshot Reference: (null) [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Receive a command message. [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Status Update. [grid-demo-1.cit.tu-berlin.de:07480] File Transfer - Global Snapshot Reference: (null) [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Receive a command message. [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Status Update. [grid-demo-1.cit.tu-berlin.de:07480] Finished - Global Snapshot Reference: ompi_global_snapshot_7022.ckpt
Snapshot Ref.:   0 ompi_global_snapshot_7022.ckpt

Example: checkpointing with two running applications

Similar to the first example, I'm again using the MPI-enabled flavor
of Povray as demo application. But now, I'm not only starting a single
Povray computation, but a second one in parallel. This gives me 8 MPI
processes (4 processes for each MPI job), so that the 8 cores of my
system are fully utilized

Without checkpointing, these two processes are executing without any
problem, each job resulting in a Povray image. However, if I'm using
the ompi-checkpoint command for checkpointing one of these two jobs,
this ompi-checkpoint is in danger of not returning.

Again I'm executing

  ompi-checkpoint -v --term 13572

(where 13752 is the PID of the mpirun process). This command gives
the following output, not returning back to the user:

[grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: Checkpointing...
[grid-demo-1.cit.tu-berlin.de:14252]     PID 13572
[grid-demo-1.cit.tu-berlin.de:14252]     Connected to Mpirun [[9529,0],0]
[grid-demo-1.cit.tu-berlin.de:14252]     Terminating after checkpoint
[grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: notify_hnp: Contact Head Node Process PID 13572 [grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID] [grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: hnp_receiver: Receive a command message. [grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: hnp_receiver: Status Update. [grid-demo-1.cit.tu-berlin.de:14252] Requested - Global Snapshot Reference: (null) [grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: hnp_receiver: Receive a command message. [grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: hnp_receiver: Status Update. [grid-demo-1.cit.tu-berlin.de:14252] Pending (Termination) - Global Snapshot Reference: (null) [grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: hnp_receiver: Receive a command message. [grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: hnp_receiver: Status Update. [grid-demo-1.cit.tu-berlin.de:14252] Running - Global Snapshot Reference: (null)

I want to underline that ompi-checkpoint is not hanging each
time I execute it while more than one job is running, but in
approx. 50% of all cases. I don't see any difference between
successful and failing calls...

Is there perhaps a way of increasing the debug output?


Reply via email to