Re: [OMPI users] OpenMPI Checkpoint/Restart is failed
Hi all, I had the same problem like Jitsumoto, i.e. OpenMPI 1.4.2 failed to restart and the patch which Fernando gave didn't work. I also tried 1.5 nightly snapshots but it seemed not working well. For some purpose, I don't want to use --enable-ft-thread in configure but the same error occurred even --enable-ft-thread is used. Here is my configure for OMPI 1.5a1r23135: >./configure \ >--with-ft=cr \ >--enable-mpi-threads \ >--with-blcr=/home/nguyen/opt/blcr --with-blcr-libdir=/home/nguyen/opt/blcr/lib \ >--prefix=/home/nguyen/opt/openmpi_1.5 --enable-mpirun-prefix-by-default \ and errors: >$ mpirun -am ft-enable-cr -machinefile ./host ./a.out >0 >0 >1 >1 >2 >2 >3 >3 >-- >mpirun has exited due to process rank 1 with PID 6582 on >node rc014 exiting improperly. There are two reasons this could occur: >1. this process did not call "init" before exiting, but others in >the job did. This can cause a job to hang indefinitely while it waits >for all processes to call "init". By rule, if one process calls "init", >then ALL processes must call "init" prior to termination. >2. this process called "init", but exited without calling "finalize". >By rule, all processes that call "init" MUST call "finalize" prior to >exiting or it will be considered an "abnormal termination" >This may have caused other processes in the application to be >terminated by signals sent by mpirun (as reported here). >--- And here is the checkpoint command: >$ ompi-checkpoint -s -v --term 10982 >[rc013.local:11001] [ 0.00 / 0.14] Requested - ... >[rc013.local:11001] [ 0.00 / 0.14] Pending - ... >[rc013.local:11001] [ 0.01 / 0.15] Running - ... >[rc013.local:11001] [ 7.79 / 7.94] Finished - >ompi_global_snapshot_10982.ckpt >Snapshot Ref.: 0 ompi_global_snapshot_10982.ckpt I also took a look inside the checkpoint files and found that the snapshot was taken: ~/tmp/ckpt/ompi_global_snapshot_10982.ckpt/0/opal_snapshot_1.ckpt/ompi_blcr_context.6582 But restarting failed as follows: >$ ompi-restart ompi_global_snapshot_10982.ckpt >-- >mpirun noticed that process rank 1 with PID 11346 on node rc013.local exited >on signal 11 (Segmentation fault). >-- Is there any idea about this? Thank you! Regards, Nguyen Toan On Mon, May 24, 2010 at 4:08 PM, Hideyuki Jitsumoto < jitum...@gsic.titech.ac.jp> wrote: > -- Forwarded message -- > From: Fernando Lemos <fernando...@gmail.com> > Date: Thu, Apr 15, 2010 at 2:18 AM > Subject: Re: [OMPI users] OpenMPI Checkpoint/Restart is failed > To: Open MPI Users <us...@open-mpi.org> > > > On Wed, Apr 14, 2010 at 5:25 AM, Hideyuki Jitsumoto > <hjitsum...@gmail.com> wrote: > > Fernando, > > > > Thank you for your reply. > > I tried to patch the file you mentioned, but the output did not change. > > I didn't test the patch, tbh. I'm using 1.5 nightly snapshots, and it > works great. > > >>Are you using a shared file system? You need to use a shared file > > system for checkpointing with 1.4.1: > > What is the shared file system ? do you mean NFS, Lustre and so on ? > > (I'm sorry about my ignorance...) > > Something like NFS, yea. > > > If I use only one node for application, do I need such a > shared-file-system ? > > No, for a single node, checkpointing with 1.4.1 should work (it works > for me, at least). If you're using a single node, then your problem is > probably not related to the bug report I posted. > > > Regards, > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > -- > Sincerely Yours, > Hideyuki Jitsumoto (jitum...@gsic.titech.ac.jp) > Tokyo Institute of Technology > Global Scientific Information and Computing center (Matsuoka Lab.) >
Re: [OMPI users] OpenMPI Checkpoint/Restart is failed
Hi Josh, Thank you for your replying. I tried to patch a Ticket #2139 to openmpi-1.4.1 and to install all of the elements from the very beginning. Then I got a correct work. Probably there are some faults on my environment preparation. # I cannot reproduce the environment when I got failure. # I'm very sorry that I cannot find truly factors of this malfunction # and cannot send any information. # Now I use openmpi-1.4.2, it works well without any patch. (except for ompi_info) >> In addition, when I confirmed open_info output as your demo movie, I got >> "MCA crs: none (MCA v2.0, API v2.0, Component v1.4.1)" (open_info.output) > > This is actually a known bug with ompi_info. I have a fix in the works for > it, and should be available soon. Until then the ticket is linked below: > https://svn.open-mpi.org/trac/ompi/ticket/2097 Thank you, I'll try it. On Wed, May 19, 2010 at 3:46 AM, Josh Hurseywrote: > (Sorry for the delay in replying, more below) > > On Apr 12, 2010, at 6:36 AM, Hideyuki Jitsumoto wrote: > >> Hi Members, >> >> I tried to use checkpoint/restart by openmpi. >> But I can not get collect checkpoint data. >> I prepared execution environment as follows, the strings in () mean >> name of output file which attached on next e-mail ( for mail size >> limitation ): >> >> 1. installed BLCR and checked BLCR is working correctly by "make check" >> 2. executed ./configure with some parameters on openMPI source dir >> (config.output / config.log) >> 3. executed make and make install (make.output.2 / install.output.2) >> 4. confirmed that mca_crs_blcr.[la|so], mca_crs_self.[la|so] on >> /${INSTALL_DIR}/lib/openmpi >> 5. make ~/.openmpi/mca-params.conf (mca-params.conf) >> 6. compiled NPB and executed with -am ft-enable-cr >> 7. invoked ompi-checkpoint >> >> As result, I got the message "Checkpoint failed: no processes >> checkpointed." >> (cr_test_cg) > > It is unclear from the output what caused the checkpoint to fail. Can you > turn on some verbose arguments and send me the output? > > Put the following options in you ~/.openmpi/mca-params.conf: > #--- > orte_debug_daemons=1 > snapc_full_verbose=20 > crs_base_verbose=10 > opal_cr_verbose=10 > #--- > > >> >> In addition, when I confirmed open_info output as your demo movie, I got >> "MCA crs: none (MCA v2.0, API v2.0, Component v1.4.1)" (open_info.output) > > This is actually a known bug with ompi_info. I have a fix in the works for > it, and should be available soon. Until then the ticket is linked below: > https://svn.open-mpi.org/trac/ompi/ticket/2097 > >> >> How should I do for checkpointing ? >> Any guidance in this regard would be highly appreciated. > > Let's see what the verbose output tells us, and go from there. What version > of BLCR are you using? > > -- Josh > >> >> Thank you, >> Hideyuki >> >> -- >> Sincerely Yours, >> Hideyuki Jitsumoto (jitum...@gsic.titech.ac.jp) >> Tokyo Institute of Technology >> Global Scientific Information and Computing center (Matsuoka Lab.) >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Sincerely Yours, Hideyuki Jitsumoto (jitum...@gsic.titech.ac.jp) Tokyo Institute of Technology Global Scientific Information and Computing center (Matsuoka Lab.)
Re: [OMPI users] OpenMPI Checkpoint/Restart is failed
(Sorry for the delay in replying, more below) On Apr 12, 2010, at 6:36 AM, Hideyuki Jitsumoto wrote: Hi Members, I tried to use checkpoint/restart by openmpi. But I can not get collect checkpoint data. I prepared execution environment as follows, the strings in () mean name of output file which attached on next e-mail ( for mail size limitation ): 1. installed BLCR and checked BLCR is working correctly by "make check" 2. executed ./configure with some parameters on openMPI source dir (config.output / config.log) 3. executed make and make install (make.output.2 / install.output.2) 4. confirmed that mca_crs_blcr.[la|so], mca_crs_self.[la|so] on /${INSTALL_DIR}/lib/openmpi 5. make ~/.openmpi/mca-params.conf (mca-params.conf) 6. compiled NPB and executed with -am ft-enable-cr 7. invoked ompi-checkpoint As result, I got the message "Checkpoint failed: no processes checkpointed." (cr_test_cg) It is unclear from the output what caused the checkpoint to fail. Can you turn on some verbose arguments and send me the output? Put the following options in you ~/.openmpi/mca-params.conf: #--- orte_debug_daemons=1 snapc_full_verbose=20 crs_base_verbose=10 opal_cr_verbose=10 #--- In addition, when I confirmed open_info output as your demo movie, I got "MCA crs: none (MCA v2.0, API v2.0, Component v1.4.1)" (open_info.output) This is actually a known bug with ompi_info. I have a fix in the works for it, and should be available soon. Until then the ticket is linked below: https://svn.open-mpi.org/trac/ompi/ticket/2097 How should I do for checkpointing ? Any guidance in this regard would be highly appreciated. Let's see what the verbose output tells us, and go from there. What version of BLCR are you using? -- Josh Thank you, Hideyuki -- Sincerely Yours, Hideyuki Jitsumoto (jitum...@gsic.titech.ac.jp) Tokyo Institute of Technology Global Scientific Information and Computing center (Matsuoka Lab.) ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] OpenMPI Checkpoint/Restart is failed
On Wed, Apr 14, 2010 at 5:25 AM, Hideyuki Jitsumotowrote: > Fernando, > > Thank you for your reply. > I tried to patch the file you mentioned, but the output did not change. I didn't test the patch, tbh. I'm using 1.5 nightly snapshots, and it works great. >>Are you using a shared file system? You need to use a shared file > system for checkpointing with 1.4.1: > What is the shared file system ? do you mean NFS, Lustre and so on ? > (I'm sorry about my ignorance...) Something like NFS, yea. > If I use only one node for application, do I need such a shared-file-system ? No, for a single node, checkpointing with 1.4.1 should work (it works for me, at least). If you're using a single node, then your problem is probably not related to the bug report I posted. Regards,
Re: [OMPI users] OpenMPI Checkpoint/Restart is failed
Fernando, Thank you for your reply. I tried to patch the file you mentioned, but the output did not change. >Are you using a shared file system? You need to use a shared file system for checkpointing with 1.4.1: What is the shared file system ? do you mean NFS, Lustre and so on ? (I'm sorry about my ignorance...) If I use only one node for application, do I need such a shared-file-system ? On Mon, Apr 12, 2010 at 9:41 PM, Fernando Lemoswrote: > On Mon, Apr 12, 2010 at 7:36 AM, Hideyuki Jitsumoto > wrote: >> Hi Members, >> >> I tried to use checkpoint/restart by openmpi. >> But I can not get collect checkpoint data. >> I prepared execution environment as follows, the strings in () mean >> name of output file which attached on next e-mail ( for mail size >> limitation ): >> >> 1. installed BLCR and checked BLCR is working correctly by "make check" >> 2. executed ./configure with some parameters on openMPI source dir >> (config.output / config.log) >> 3. executed make and make install (make.output.2 / install.output.2) >> 4. confirmed that mca_crs_blcr.[la|so], mca_crs_self.[la|so] on >> /${INSTALL_DIR}/lib/openmpi >> 5. make ~/.openmpi/mca-params.conf (mca-params.conf) >> 6. compiled NPB and executed with -am ft-enable-cr >> 7. invoked ompi-checkpoint >> >> As result, I got the message "Checkpoint failed: no processes checkpointed." >> (cr_test_cg) > > Are you using a shared file system? You need to use a shared file > system for checkpointing with 1.4.1: > > https://svn.open-mpi.org/trac/ompi/ticket/2139 > > Regards, > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- Sincerely Yours, Hideyuki Jitsumoto (jitum...@gsic.titech.ac.jp) Tokyo Institute of Technology Global Scientific Information and Computing center (Matsuoka Lab.)
Re: [OMPI users] OpenMPI Checkpoint/Restart is failed
On Mon, Apr 12, 2010 at 7:36 AM, Hideyuki Jitsumotowrote: > Hi Members, > > I tried to use checkpoint/restart by openmpi. > But I can not get collect checkpoint data. > I prepared execution environment as follows, the strings in () mean > name of output file which attached on next e-mail ( for mail size > limitation ): > > 1. installed BLCR and checked BLCR is working correctly by "make check" > 2. executed ./configure with some parameters on openMPI source dir > (config.output / config.log) > 3. executed make and make install (make.output.2 / install.output.2) > 4. confirmed that mca_crs_blcr.[la|so], mca_crs_self.[la|so] on > /${INSTALL_DIR}/lib/openmpi > 5. make ~/.openmpi/mca-params.conf (mca-params.conf) > 6. compiled NPB and executed with -am ft-enable-cr > 7. invoked ompi-checkpoint > > As result, I got the message "Checkpoint failed: no processes checkpointed." > (cr_test_cg) Are you using a shared file system? You need to use a shared file system for checkpointing with 1.4.1: https://svn.open-mpi.org/trac/ompi/ticket/2139 Regards,
Re: [OMPI users] OpenMPI Checkpoint/Restart is failed
I attache a file (2/2) on this email as mentioned previous one. Thank you, Hideyuki * ** ** ** WARNING: This email contains an attachment of a very suspicious type. ** ** You are urged NOT to open this attachment unless you are absolutely ** ** sure it is legitimate. Opening this attachment may cause irreparable ** ** damage to your computer and your files. If you have any questions ** ** about the validity of this message, PLEASE SEEK HELP BEFORE OPENING IT. ** ** ** ** This warning was added by the IU Computer Science Dept. mail scanner. ** * <>
Re: [OMPI users] OpenMPI Checkpoint/Restart is failed
I attache a file (1/2) on this email as mentioned previous one. I'm very sorry to send the large log file. Thank you, Hideyuki * ** ** ** WARNING: This email contains an attachment of a very suspicious type. ** ** You are urged NOT to open this attachment unless you are absolutely ** ** sure it is legitimate. Opening this attachment may cause irreparable ** ** damage to your computer and your files. If you have any questions ** ** about the validity of this message, PLEASE SEEK HELP BEFORE OPENING IT. ** ** ** ** This warning was added by the IU Computer Science Dept. mail scanner. ** * <>
[OMPI users] OpenMPI Checkpoint/Restart is failed
Hi Members, I tried to use checkpoint/restart by openmpi. But I can not get collect checkpoint data. I prepared execution environment as follows, the strings in () mean name of output file which attached on next e-mail ( for mail size limitation ): 1. installed BLCR and checked BLCR is working correctly by "make check" 2. executed ./configure with some parameters on openMPI source dir (config.output / config.log) 3. executed make and make install (make.output.2 / install.output.2) 4. confirmed that mca_crs_blcr.[la|so], mca_crs_self.[la|so] on /${INSTALL_DIR}/lib/openmpi 5. make ~/.openmpi/mca-params.conf (mca-params.conf) 6. compiled NPB and executed with -am ft-enable-cr 7. invoked ompi-checkpoint As result, I got the message "Checkpoint failed: no processes checkpointed." (cr_test_cg) In addition, when I confirmed open_info output as your demo movie, I got "MCA crs: none (MCA v2.0, API v2.0, Component v1.4.1)" (open_info.output) How should I do for checkpointing ? Any guidance in this regard would be highly appreciated. Thank you, Hideyuki -- Sincerely Yours, Hideyuki Jitsumoto (jitum...@gsic.titech.ac.jp) Tokyo Institute of Technology Global Scientific Information and Computing center (Matsuoka Lab.)