Re: [OMPI users] Checkpointing a restarted app fails

Josh Hursey Wed, 17 Sep 2008 07:27:38 -0400


On Sep 16, 2008, at 11:18 PM, Matthias Hovestadt wrote:

Hi!

Since I am interested in fault tolerance, checkpointing and
restart of OMPI is an intersting feature for me. So I installed
BLCR 0.7.3 as well as OMPI from SVN (rev. 19553). For OMPI
I followed the instructions in the "Fault Tolerance Guide"
in the OMPI wiki:

./autogen.sh
./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads
make -s

This gave me an OMPI version with checkpointing support, so I
started testing. The good news is: I am able to checkpoint and
restart applications. The bad news is: checkpointing a restarted
application fails.

In detail:

1) Starting the application

ccs@grid-demo-1:~$ ompi-clean

ccs@grid-demo-1:~$ mpirun -np 2 -am ft-enable-cr yafaray-xmlyafaray.xml


This starts my MPI-enabled application without any problems.


2) Checkpointing the application

First I queried the PID of the mpirun process:

ccs@grid-demo-1:~$ ps auxww | grep mpirun

ccs 13897 0.4 0.2 63992 2704 pts/0 S+ 04:59 0:00mpirun -np 2 -am ft-enable-cr yafaray-xml yafaray.xml


Then I checkpointed the job, terminating it directly:

ccs@grid-demo-1:~$ ompi-checkpoint --term 13897
Snapshot Ref.:   0 ompi_global_snapshot_13897.ckpt
ccs@grid-demo-1:~$

The application indeed terminated:

--------------------------------------------------------------------------mpirun noticed that process rank 0 with PID 13898 on node grid-demo-1.cit.tu-berlin.de exited on signal 0 (Unknown signal 0).--------------------------------------------------------------------------

2 total processes killed (some possibly by mpirun during cleanup)

The checkpoint command generated a checkpoint dataset
of 367MB size:

ccs@grid-demo-1:~$ du -s -h ompi_global_snapshot_13897.ckpt/
367M    ompi_global_snapshot_13897.ckpt/
ccs@grid-demo-1:~$



3) Restarting the application

For restarting the application, I first executed ompi-clean,
then restarting the job with preloading all files:

ccs@grid-demo-1:~$ ompi-clean

ccs@grid-demo-1:~$ ompi-restart --preloadompi_global_snapshot_13897.ckpt/


Restarting works pretty fine. The jobs restarts from the
checkpointed state and continues to execute. If not interrupted,
it continues until its end, returning a correct result.

However, I observed one weird thing: restarting the application
seemed to have the checkpoint dataset changed. Moreover, two new
directories have been created at restart time:

4 drwx------ 3 ccs ccs 4096 Sep 17 05:09ompi_global_snapshot_13897.ckpt

  4 drwx------  2 ccs  ccs    4096 Sep 17 05:09 opal_snapshot_0.ckpt
  4 drwx------  2 ccs  ccs    4096 Sep 17 05:09 opal_snapshot_1.ckpt

The ('opal_snapshot_*.ckpt') directories are an artifact of the --preload option. This option will copy the individual checkpoint tothe remote machine before executing.

4) Checkpointing again

Again I first looked for the PID of the running mpirun process:

ccs@grid-demo-1:~$ ps auxww | grep mpirun
ccs 14005 0.0 0.2 63992 2736 pts/1 S+ 05:09 0:00mpirun -am ft-enable-cr --app /home/ccs/ompi_global_snapshot_13897.ckpt/restart-appfile
Then I checkpointed it:

ccs@grid-demo-1:~$ ompi-checkpoint 14005


When executing this checkpoint command, the running application
directly aborts, even though I did not specify the "--term" option:
--------------------------------------------------------------------------mpirun noticed that process rank 1 with PID 14050 on node grid-demo-1.cit.tu-berlin.de exited on signal 13 (Broken pipe).--------------------------------------------------------------------------
ccs@grid-demo-1:~$

Interesting. This looks like a bug with the restart mechanism in OpenMPI. This was working fine, but something must have changed in thetrunk to break it.

A useful piece of debugging information for me would be a stack tracefrom the failed process. You should be able to get this from a corefile it left or If you would set the following MCA variable in$HOME/.openmpi/mca-params.conf:

  opal_cr_debug_sigpipe=1

This will cause the Open MPI app to wait in a sleep loop when itdetects a Broken Pipe signal. Then you should be able to attach adebugger and retrieve a stack trace.



The "ompi-checkpoint 14005" command however does not return.



Is anybody here using checkpoint/restart capabilities of OMPI?
Did anybody encounter similar problems? Or is there something
wrong about my way of using ompi-checkpoint/ompi-restart?

I work with the checkpoint/restart functionality on a daily basis,but I must admit that I haven't worked on the trunk in a few weeks.I'll take a look and let you know what I find. I suspect that OpenMPI is not resetting properly after a checkpoint.



Any hint is greatly appreciated! :-)



Best,
Matthias
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Checkpointing a restarted app fails

Reply via email to