[OMPI users] Checkpointing fails with BLCR 0.8.0b2
Hi! Berkely recently released a new version of their BLCR. They already marked the function cr_request_file as deprecated in BLCR 0.7.3. Now they removed deprecated functions from libcr API. Since checkpointing support of OMPI is using cr_request_file, all checkpointing operations fail with BLCR 0.8.0b2, making a downgrade to BLCR 0.7.3 necessary. Best, Matthias
Re: [OMPI users] ompi-checkpoint is hanging
Hi Tim! First of all: thanks a lot for answering! :-) Could you try running your two MPI jobs with fewer procs each, say 2 or 3 each instead of 4, so that there are a few extra cores available. This problem occurrs with any number of procs. Also, what happens to the checkpointing of one MPI job if you kill the other MPI job after the first "hangs"? Nothing, it keeps hanging. > (It may not be a true hang, but very very slow progress that you > are observing.) I already waited for more than 12 hours, but the ompi-checkpoint did not return. So if it's slow, it must be very slow. I continued testing and just observed a case where the problem occurred with only one job running on the compute node: --- ccs@grid-demo-1:~$ ps auxww | grep mpirun | grep -v grep ccs 7706 0.4 0.2 63864 2640 ?S15:35 0:00 mpirun -np 1 -am ft-enable-cr -np 6 /home/ccs/XN-OMPI/testdrive/loop-1/remotedir/mpi-x-povray +I planet.pov -w1600 -h1200 +SP1 +O planet.tga ccs@grid-demo-1:~$ --- The resource management system tried to checkpoint this job using the command "ompi-checkpoint -v --term 7706". This is the output of that command: --- [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: Checkpointing... [grid-demo-1.cit.tu-berlin.de:08178] PID 7706 [grid-demo-1.cit.tu-berlin.de:08178] Connected to Mpirun [[3623,0],0] [grid-demo-1.cit.tu-berlin.de:08178] Terminating after checkpoint [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: notify_hnp: Contact Head Node Process PID 7706 [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID] [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver: Receive a command message. [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver: Status Update. [grid-demo-1.cit.tu-berlin.de:08178] Requested - Global Snapshot Reference: (null) [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver: Receive a command message. [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver: Status Update. [grid-demo-1.cit.tu-berlin.de:08178] Pending (Termination) - Global Snapshot Reference: (null) [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver: Receive a command message. [grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver: Status Update. [grid-demo-1.cit.tu-berlin.de:08178] Running - Global Snapshot Reference: (null) --- If I look to the activity on the node, I see that the processes are still computing: --- PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 7710 ccs 25 0 327m 6936 4052 R 102 0.7 4:14.17 mpi-x-povray 7712 ccs 25 0 327m 6884 4000 R 102 0.7 3:34.06 mpi-x-povray 7708 ccs 25 0 327m 6896 4012 R 66 0.7 2:42.10 mpi-x-povray 7707 ccs 25 0 331m 10m 3736 R 54 1.0 3:08.62 mpi-x-povray 7709 ccs 25 0 327m 6940 4056 R 48 0.7 1:48.24 mpi-x-povray 7711 ccs 25 0 327m 6724 4032 R 36 0.7 1:29.34 mpi-x-povray --- Now I killed the hanging ompi-checkpoint operation and tried to execute a checkpoint manually: --- ccs@grid-demo-1:~$ ompi-checkpoint -v --term 7706 [grid-demo-1.cit.tu-berlin.de:08224] orte_checkpoint: Checkpointing... [grid-demo-1.cit.tu-berlin.de:08224] PID 7706 [grid-demo-1.cit.tu-berlin.de:08224] Connected to Mpirun [[3623,0],0] [grid-demo-1.cit.tu-berlin.de:08224] Terminating after checkpoint [grid-demo-1.cit.tu-berlin.de:08224] orte_checkpoint: notify_hnp: Contact Head Node Process PID 7706 [grid-demo-1.cit.tu-berlin.de:08224] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID] --- Is there perhaps a way of increasing the level of debug output? Please let me know if I can support you in any way... Best, Matthias
[OMPI users] ompi-checkpoint is hanging
Hi! I'm using the development version of OMPI from SVN (rev. 19857) for executing MPI jobs on my cluster system. I'm particularly using the checkpoint and restart feature, basing on the currentmost version of BLCR. The checkpointing is working pretty fine as long as I only execute a single job on a node. If more than one MPI application is executing on a system, ompi-checkpoint sometimes does not return, hanging forever. Example: checkpointing with a single running application I'm using the MPI-enabled flavor of Povray as demo application. So I'm starting it on a node using the following command. mpirun -np 4 mpi-x-povray +I planet.pov -w1200 -h1000 +SP1 \ +O planet.tga This gives me 4 MPI processes, all running on the local node. checkpointing it with ompi-checkpoint -v --term 7022 (where 7022 is the PID of the mpirun process) gives me a checkpoint dataset ompi_global_snapshot_7022.ckpt, that can be used for restarting the job. The ompi-checkpoint command gives the following output: --- [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: Checkpointing... [grid-demo-1.cit.tu-berlin.de:07480] PID 7022 [grid-demo-1.cit.tu-berlin.de:07480] Connected to Mpirun [[2899,0],0] [grid-demo-1.cit.tu-berlin.de:07480] Terminating after checkpoint [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: notify_hnp: Contact Head Node Process PID 7022 [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID] [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Receive a command message. [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Status Update. [grid-demo-1.cit.tu-berlin.de:07480] Requested - Global Snapshot Reference: (null) [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Receive a command message. [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Status Update. [grid-demo-1.cit.tu-berlin.de:07480] Pending (Termination) - Global Snapshot Reference: (null) [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Receive a command message. [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Status Update. [grid-demo-1.cit.tu-berlin.de:07480] Running - Global Snapshot Reference: (null) [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Receive a command message. [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Status Update. [grid-demo-1.cit.tu-berlin.de:07480] File Transfer - Global Snapshot Reference: (null) [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Receive a command message. [grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: Status Update. [grid-demo-1.cit.tu-berlin.de:07480] Finished - Global Snapshot Reference: ompi_global_snapshot_7022.ckpt Snapshot Ref.: 0 ompi_global_snapshot_7022.ckpt --- Example: checkpointing with two running applications Similar to the first example, I'm again using the MPI-enabled flavor of Povray as demo application. But now, I'm not only starting a single Povray computation, but a second one in parallel. This gives me 8 MPI processes (4 processes for each MPI job), so that the 8 cores of my system are fully utilized Without checkpointing, these two processes are executing without any problem, each job resulting in a Povray image. However, if I'm using the ompi-checkpoint command for checkpointing one of these two jobs, this ompi-checkpoint is in danger of not returning. Again I'm executing ompi-checkpoint -v --term 13572 (where 13752 is the PID of the mpirun process). This command gives the following output, not returning back to the user: --- [grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: Checkpointing... [grid-demo-1.cit.tu-berlin.de:14252] PID 13572 [grid-demo-1.cit.tu-berlin.de:14252] Connected to Mpirun [[9529,0],0] [grid-demo-1.cit.tu-berlin.de:14252] Terminating after checkpoint [grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: notify_hnp: Contact Head Node Process PID 13572 [grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID] [grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: hnp_receiver: Receive a command message. [grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: hnp_receiver: Status Update. [grid-demo-1.cit.tu-berlin.de:14252] Requested - Global Snapshot Reference: (null) [grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: hnp_receiver: Receive a command message. [grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: hnp_receiver: Status Update. [grid-demo-1.cit.tu-berlin.de:14252] Pending (Termination) - Global Snapshot Reference: (null)
Re: [OMPI users] Checkpointing a restarted app fails
Hi Josh! I believe this is now fixed in the trunk. I was able to reproduce with the current trunk and committed a fix a few minutes ago in r19601. So the fix should be in tonight's tarball (or you can grab it from SVN). I've made a request to have the patch applied to v1.3, but that may take a day or so to complete. I updated to 19607 and this really worked out. I'm now able to checkpoint restarted applications without any problems. Yippee! Thanks for the bug report :) Thanks for fixing it :-) Best, Matthias
Re: [OMPI users] Checkpointing a restarted app fails
Hi Josh! First of all, thanks a lot for replying. :-) When executing this checkpoint command, the running application directly aborts, even though I did not specify the "--term" option: -- mpirun noticed that process rank 1 with PID 14050 on node grid-demo-1.cit.tu-berlin.de exited on signal 13 (Broken pipe). -- ccs@grid-demo-1:~$ Interesting. This looks like a bug with the restart mechanism in Open MPI. This was working fine, but something must have changed in the trunk to break it. Do you perhaps know a SVN revision number of OMPI that is known to be working? If this issue is a regression failure, I would be glad to use the source from an old but working SVN state... A useful piece of debugging information for me would be a stack trace from the failed process. You should be able to get this from a core file it left or If you would set the following MCA variable in $HOME/.openmpi/mca-params.conf: opal_cr_debug_sigpipe=1 This will cause the Open MPI app to wait in a sleep loop when it detects a Broken Pipe signal. Then you should be able to attach a debugger and retrieve a stack trace. I created this file: ccs@grid-demo-1:~$ cat .openmpi/mca-params.conf opal_cr_debug_sigpipe=1 ccs@grid-demo-1:~$ Then I restarted the application from a checkpointed state and tried to checkpoint this restarted application. Unfortunately the restated application still terminates, despite of this para- meter. However, the output slightly changed : worker fetch area available 1 [grid-demo-1.cit.tu-berlin.de:26220] opal_cr: sigpipe_debug: Debug SIGPIPE [13]: PID (26220) -- mpirun noticed that process rank 0 with PID 26248 on node grid-demo-1.cit.tu-berlin.de exited on signal 0 (Unknown signal 0). -- 2 total processes killed (some possibly by mpirun during cleanup) ccs@grid-demo-1:~$ There is now this additional "opal_cr: sigpipe_debug" line, so he apparently evaluates the .openmpi/mca-params.conf I also tried to get a corefile by setting "ulimit -c 5", so that ulimit -a gives me the following output: ccs@grid-demo-1:~$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 20 file size (blocks, -f) unlimited pending signals (-i) unlimited max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) unlimited real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) unlimited virtual memory (kbytes, -v) unlimited file locks (-x) unlimited ccs@grid-demo-1:~$ Unfortunately, no corefile is generated, so that I do not know how to give you the requested stacktrace. Are there perhaps other debug parameters I could use? Best, Matthias
Re: [OMPI users] Where is ompi-chekpoint?
Hi! Hi, I have installed openmpi-1.2.7 with following instructions: ./configure --with-ft=cr --enable-ft-enable-thread --enable-mpi-thread --with-blcr=$HOME/blcr --prefix=$HOME/openmpi make all install In directory bin of directory $HOME/openmpi there is not ompi-checkpoint and ompi-restart. As far as I know, checkpointing support is not available in OMPI 1.2.7. You have to use the devel version (1.3) of OMPI, e.g. by checking out the source from SVN. Best, Matthias
[OMPI users] Checkpointing a restarted app fails
Hi! Since I am interested in fault tolerance, checkpointing and restart of OMPI is an intersting feature for me. So I installed BLCR 0.7.3 as well as OMPI from SVN (rev. 19553). For OMPI I followed the instructions in the "Fault Tolerance Guide" in the OMPI wiki: ./autogen.sh ./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads make -s This gave me an OMPI version with checkpointing support, so I started testing. The good news is: I am able to checkpoint and restart applications. The bad news is: checkpointing a restarted application fails. In detail: 1) Starting the application ccs@grid-demo-1:~$ ompi-clean ccs@grid-demo-1:~$ mpirun -np 2 -am ft-enable-cr yafaray-xml yafaray.xml This starts my MPI-enabled application without any problems. 2) Checkpointing the application First I queried the PID of the mpirun process: ccs@grid-demo-1:~$ ps auxww | grep mpirun ccs 13897 0.4 0.2 63992 2704 pts/0S+ 04:59 0:00 mpirun -np 2 -am ft-enable-cr yafaray-xml yafaray.xml Then I checkpointed the job, terminating it directly: ccs@grid-demo-1:~$ ompi-checkpoint --term 13897 Snapshot Ref.: 0 ompi_global_snapshot_13897.ckpt ccs@grid-demo-1:~$ The application indeed terminated: -- mpirun noticed that process rank 0 with PID 13898 on node grid-demo-1.cit.tu-berlin.de exited on signal 0 (Unknown signal 0). -- 2 total processes killed (some possibly by mpirun during cleanup) The checkpoint command generated a checkpoint dataset of 367MB size: ccs@grid-demo-1:~$ du -s -h ompi_global_snapshot_13897.ckpt/ 367Mompi_global_snapshot_13897.ckpt/ ccs@grid-demo-1:~$ 3) Restarting the application For restarting the application, I first executed ompi-clean, then restarting the job with preloading all files: ccs@grid-demo-1:~$ ompi-clean ccs@grid-demo-1:~$ ompi-restart --preload ompi_global_snapshot_13897.ckpt/ Restarting works pretty fine. The jobs restarts from the checkpointed state and continues to execute. If not interrupted, it continues until its end, returning a correct result. However, I observed one weird thing: restarting the application seemed to have the checkpoint dataset changed. Moreover, two new directories have been created at restart time: 4 drwx-- 3 ccs ccs4096 Sep 17 05:09 ompi_global_snapshot_13897.ckpt 4 drwx-- 2 ccs ccs4096 Sep 17 05:09 opal_snapshot_0.ckpt 4 drwx-- 2 ccs ccs4096 Sep 17 05:09 opal_snapshot_1.ckpt 4) Checkpointing again Again I first looked for the PID of the running mpirun process: ccs@grid-demo-1:~$ ps auxww | grep mpirun ccs 14005 0.0 0.2 63992 2736 pts/1S+ 05:09 0:00 mpirun -am ft-enable-cr --app /home/ccs/ompi_global_snapshot_13897.ckpt/restart-appfile Then I checkpointed it: ccs@grid-demo-1:~$ ompi-checkpoint 14005 When executing this checkpoint command, the running application directly aborts, even though I did not specify the "--term" option: -- mpirun noticed that process rank 1 with PID 14050 on node grid-demo-1.cit.tu-berlin.de exited on signal 13 (Broken pipe). -- ccs@grid-demo-1:~$ The "ompi-checkpoint 14005" command however does not return. Is anybody here using checkpoint/restart capabilities of OMPI? Did anybody encounter similar problems? Or is there something wrong about my way of using ompi-checkpoint/ompi-restart? Any hint is greatly appreciated! :-) Best, Matthias
Re: [OMPI users] Checkpoint problem
Hi Gabriele! In this case, mpirun works well, but the checkpoint procedure fails: ompi-checkpoint 20109 [node0316:20134] Error: Unable to get the current working directory [node0316:20134] [[42404,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line 395 [node0316:20134] HNP with PID 20109 Not found! I had exactly the same problem on my machine. Neither modifying the configure parameters nor the way of invoking the ompi-checkpoint command did help. Since I am using the source from subversion checkout, I also updated the source several times, following the day to day progress. However, this problem remained. Luckily, updating the source to SVN revision 19265 finally solved this checkpointing issue. Maybe the problem shows up again in later versions... Best, Matthias