Re: [OMPI users] Checkpointing a restarted app fails
Hi Josh! I believe this is now fixed in the trunk. I was able to reproduce with the current trunk and committed a fix a few minutes ago in r19601. So the fix should be in tonight's tarball (or you can grab it from SVN). I've made a request to have the patch applied to v1.3, but that may take a day or so to complete. I updated to 19607 and this really worked out. I'm now able to checkpoint restarted applications without any problems. Yippee! Thanks for the bug report :) Thanks for fixing it :-) Best, Matthias
Re: [OMPI users] Checkpointing a restarted app fails
I believe this is now fixed in the trunk. I was able to reproduce with the current trunk and committed a fix a few minutes ago in r19601. So the fix should be in tonight's tarball (or you can grab it from SVN). I've made a request to have the patch applied to v1.3, but that may take a day or so to complete. Let me know if this fix eliminates your SIGPIPE issues. Thanks for the bug report :) Cheers, Josh On Sep 17, 2008, at 11:55 PM, Matthias Hovestadt wrote: Hi Josh! First of all, thanks a lot for replying. :-) When executing this checkpoint command, the running application directly aborts, even though I did not specify the "--term" option: -- mpirun noticed that process rank 1 with PID 14050 on node grid- demo-1.cit.tu-berlin.de exited on signal 13 (Broken pipe). -- ccs@grid-demo-1:~$ Interesting. This looks like a bug with the restart mechanism in Open MPI. This was working fine, but something must have changed in the trunk to break it. Do you perhaps know a SVN revision number of OMPI that is known to be working? If this issue is a regression failure, I would be glad to use the source from an old but working SVN state... A useful piece of debugging information for me would be a stack trace from the failed process. You should be able to get this from a core file it left or If you would set the following MCA variable in $HOME/.openmpi/mca-params.conf: opal_cr_debug_sigpipe=1 This will cause the Open MPI app to wait in a sleep loop when it detects a Broken Pipe signal. Then you should be able to attach a debugger and retrieve a stack trace. I created this file: ccs@grid-demo-1:~$ cat .openmpi/mca-params.conf opal_cr_debug_sigpipe=1 ccs@grid-demo-1:~$ Then I restarted the application from a checkpointed state and tried to checkpoint this restarted application. Unfortunately the restated application still terminates, despite of this para- meter. However, the output slightly changed : worker fetch area available 1 [grid-demo-1.cit.tu-berlin.de:26220] opal_cr: sigpipe_debug: Debug SIGPIPE [13]: PID (26220) -- mpirun noticed that process rank 0 with PID 26248 on node grid- demo-1.cit.tu-berlin.de exited on signal 0 (Unknown signal 0). -- 2 total processes killed (some possibly by mpirun during cleanup) ccs@grid-demo-1:~$ There is now this additional "opal_cr: sigpipe_debug" line, so he apparently evaluates the .openmpi/mca-params.conf I also tried to get a corefile by setting "ulimit -c 5", so that ulimit -a gives me the following output: ccs@grid-demo-1:~$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 20 file size (blocks, -f) unlimited pending signals (-i) unlimited max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) unlimited real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) unlimited virtual memory (kbytes, -v) unlimited file locks (-x) unlimited ccs@grid-demo-1:~$ Unfortunately, no corefile is generated, so that I do not know how to give you the requested stacktrace. Are there perhaps other debug parameters I could use? Best, Matthias ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Checkpointing a restarted app fails
Hi Josh! First of all, thanks a lot for replying. :-) When executing this checkpoint command, the running application directly aborts, even though I did not specify the "--term" option: -- mpirun noticed that process rank 1 with PID 14050 on node grid-demo-1.cit.tu-berlin.de exited on signal 13 (Broken pipe). -- ccs@grid-demo-1:~$ Interesting. This looks like a bug with the restart mechanism in Open MPI. This was working fine, but something must have changed in the trunk to break it. Do you perhaps know a SVN revision number of OMPI that is known to be working? If this issue is a regression failure, I would be glad to use the source from an old but working SVN state... A useful piece of debugging information for me would be a stack trace from the failed process. You should be able to get this from a core file it left or If you would set the following MCA variable in $HOME/.openmpi/mca-params.conf: opal_cr_debug_sigpipe=1 This will cause the Open MPI app to wait in a sleep loop when it detects a Broken Pipe signal. Then you should be able to attach a debugger and retrieve a stack trace. I created this file: ccs@grid-demo-1:~$ cat .openmpi/mca-params.conf opal_cr_debug_sigpipe=1 ccs@grid-demo-1:~$ Then I restarted the application from a checkpointed state and tried to checkpoint this restarted application. Unfortunately the restated application still terminates, despite of this para- meter. However, the output slightly changed : worker fetch area available 1 [grid-demo-1.cit.tu-berlin.de:26220] opal_cr: sigpipe_debug: Debug SIGPIPE [13]: PID (26220) -- mpirun noticed that process rank 0 with PID 26248 on node grid-demo-1.cit.tu-berlin.de exited on signal 0 (Unknown signal 0). -- 2 total processes killed (some possibly by mpirun during cleanup) ccs@grid-demo-1:~$ There is now this additional "opal_cr: sigpipe_debug" line, so he apparently evaluates the .openmpi/mca-params.conf I also tried to get a corefile by setting "ulimit -c 5", so that ulimit -a gives me the following output: ccs@grid-demo-1:~$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 20 file size (blocks, -f) unlimited pending signals (-i) unlimited max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) unlimited real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) unlimited virtual memory (kbytes, -v) unlimited file locks (-x) unlimited ccs@grid-demo-1:~$ Unfortunately, no corefile is generated, so that I do not know how to give you the requested stacktrace. Are there perhaps other debug parameters I could use? Best, Matthias
Re: [OMPI users] Checkpointing a restarted app fails
On Sep 16, 2008, at 11:18 PM, Matthias Hovestadt wrote: Hi! Since I am interested in fault tolerance, checkpointing and restart of OMPI is an intersting feature for me. So I installed BLCR 0.7.3 as well as OMPI from SVN (rev. 19553). For OMPI I followed the instructions in the "Fault Tolerance Guide" in the OMPI wiki: ./autogen.sh ./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads make -s This gave me an OMPI version with checkpointing support, so I started testing. The good news is: I am able to checkpoint and restart applications. The bad news is: checkpointing a restarted application fails. In detail: 1) Starting the application ccs@grid-demo-1:~$ ompi-clean ccs@grid-demo-1:~$ mpirun -np 2 -am ft-enable-cr yafaray-xml yafaray.xml This starts my MPI-enabled application without any problems. 2) Checkpointing the application First I queried the PID of the mpirun process: ccs@grid-demo-1:~$ ps auxww | grep mpirun ccs 13897 0.4 0.2 63992 2704 pts/0S+ 04:59 0:00 mpirun -np 2 -am ft-enable-cr yafaray-xml yafaray.xml Then I checkpointed the job, terminating it directly: ccs@grid-demo-1:~$ ompi-checkpoint --term 13897 Snapshot Ref.: 0 ompi_global_snapshot_13897.ckpt ccs@grid-demo-1:~$ The application indeed terminated: -- mpirun noticed that process rank 0 with PID 13898 on node grid- demo-1.cit.tu-berlin.de exited on signal 0 (Unknown signal 0). -- 2 total processes killed (some possibly by mpirun during cleanup) The checkpoint command generated a checkpoint dataset of 367MB size: ccs@grid-demo-1:~$ du -s -h ompi_global_snapshot_13897.ckpt/ 367Mompi_global_snapshot_13897.ckpt/ ccs@grid-demo-1:~$ 3) Restarting the application For restarting the application, I first executed ompi-clean, then restarting the job with preloading all files: ccs@grid-demo-1:~$ ompi-clean ccs@grid-demo-1:~$ ompi-restart --preload ompi_global_snapshot_13897.ckpt/ Restarting works pretty fine. The jobs restarts from the checkpointed state and continues to execute. If not interrupted, it continues until its end, returning a correct result. However, I observed one weird thing: restarting the application seemed to have the checkpoint dataset changed. Moreover, two new directories have been created at restart time: 4 drwx-- 3 ccs ccs4096 Sep 17 05:09 ompi_global_snapshot_13897.ckpt 4 drwx-- 2 ccs ccs4096 Sep 17 05:09 opal_snapshot_0.ckpt 4 drwx-- 2 ccs ccs4096 Sep 17 05:09 opal_snapshot_1.ckpt The ('opal_snapshot_*.ckpt') directories are an artifact of the -- preload option. This option will copy the individual checkpoint to the remote machine before executing. 4) Checkpointing again Again I first looked for the PID of the running mpirun process: ccs@grid-demo-1:~$ ps auxww | grep mpirun ccs 14005 0.0 0.2 63992 2736 pts/1S+ 05:09 0:00 mpirun -am ft-enable-cr --app /home/ccs/ ompi_global_snapshot_13897.ckpt/restart-appfile Then I checkpointed it: ccs@grid-demo-1:~$ ompi-checkpoint 14005 When executing this checkpoint command, the running application directly aborts, even though I did not specify the "--term" option: -- mpirun noticed that process rank 1 with PID 14050 on node grid- demo-1.cit.tu-berlin.de exited on signal 13 (Broken pipe). -- ccs@grid-demo-1:~$ Interesting. This looks like a bug with the restart mechanism in Open MPI. This was working fine, but something must have changed in the trunk to break it. A useful piece of debugging information for me would be a stack trace from the failed process. You should be able to get this from a core file it left or If you would set the following MCA variable in $HOME/.openmpi/mca-params.conf: opal_cr_debug_sigpipe=1 This will cause the Open MPI app to wait in a sleep loop when it detects a Broken Pipe signal. Then you should be able to attach a debugger and retrieve a stack trace. The "ompi-checkpoint 14005" command however does not return. Is anybody here using checkpoint/restart capabilities of OMPI? Did anybody encounter similar problems? Or is there something wrong about my way of using ompi-checkpoint/ompi-restart? I work with the checkpoint/restart functionality on a daily basis, but I must admit that I haven't worked on the trunk in a few weeks. I'll take a look and let you know what I find. I suspect that Open MPI is not resetting properly after a checkpoint. Any hint is greatly appreciated! :-) Best, Matthias ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users