Re: [OMPI users] OpenMPI Checkpoint/Restart is failed

2010-05-24 Thread Nguyen Toan
Hi all,

I had the same problem like Jitsumoto, i.e. OpenMPI 1.4.2 failed to restart
and the patch which Fernando gave didn't work.
I also tried 1.5 nightly snapshots but it seemed not working well.
For some purpose, I don't want to use --enable-ft-thread in configure but
the same error occurred even --enable-ft-thread is used.
Here is my configure for OMPI 1.5a1r23135:

>./configure \
>--with-ft=cr \
>--enable-mpi-threads \
>--with-blcr=/home/nguyen/opt/blcr
--with-blcr-libdir=/home/nguyen/opt/blcr/lib \
>--prefix=/home/nguyen/opt/openmpi_1.5 --enable-mpirun-prefix-by-default \

and errors:

>$ mpirun -am ft-enable-cr -machinefile ./host ./a.out
>0
>0
>1
>1
>2
>2
>3
>3
>--
>mpirun has exited due to process rank 1 with PID 6582 on
>node rc014 exiting improperly. There are two reasons this could occur:

>1. this process did not call "init" before exiting, but others in
>the job did. This can cause a job to hang indefinitely while it waits
>for all processes to call "init". By rule, if one process calls "init",
>then ALL processes must call "init" prior to termination.

>2. this process called "init", but exited without calling "finalize".
>By rule, all processes that call "init" MUST call "finalize" prior to
>exiting or it will be considered an "abnormal termination"

>This may have caused other processes in the application to be
>terminated by signals sent by mpirun (as reported here).
>---

And here is the checkpoint command:

>$ ompi-checkpoint -s -v --term 10982
>[rc013.local:11001] [  0.00 /   0.14] Requested - ...
>[rc013.local:11001] [  0.00 /   0.14]   Pending - ...
>[rc013.local:11001] [  0.01 /   0.15]   Running - ...
>[rc013.local:11001] [  7.79 /   7.94]  Finished -
>ompi_global_snapshot_10982.ckpt
>Snapshot Ref.:   0 ompi_global_snapshot_10982.ckpt

I also took a look inside the checkpoint files and found that the snapshot
was
taken: 
~/tmp/ckpt/ompi_global_snapshot_10982.ckpt/0/opal_snapshot_1.ckpt/ompi_blcr_context.6582

But restarting failed as follows:
>$ ompi-restart ompi_global_snapshot_10982.ckpt
>--
>mpirun noticed that process rank 1 with PID 11346 on node rc013.local
exited >on signal 11 (Segmentation fault).
>--

Is there any idea about this? Thank you!

Regards,
Nguyen Toan


On Mon, May 24, 2010 at 4:08 PM, Hideyuki Jitsumoto <
jitum...@gsic.titech.ac.jp> wrote:

> -- Forwarded message --
> From: Fernando Lemos <fernando...@gmail.com>
> Date: Thu, Apr 15, 2010 at 2:18 AM
> Subject: Re: [OMPI users] OpenMPI Checkpoint/Restart is failed
> To: Open MPI Users <us...@open-mpi.org>
>
>
> On Wed, Apr 14, 2010 at 5:25 AM, Hideyuki Jitsumoto
> <hjitsum...@gmail.com> wrote:
> > Fernando,
> >
> > Thank you for your reply.
> > I tried to patch the file you mentioned, but the output did not change.
>
> I didn't test the patch, tbh. I'm using 1.5 nightly snapshots, and it
> works great.
>
> >>Are you using a shared file system? You need to use a shared file
> > system for checkpointing with 1.4.1:
> > What is the shared file system ? do you mean NFS, Lustre and so on ?
> > (I'm sorry about my ignorance...)
>
> Something like NFS, yea.
>
> > If I use only one node for application, do I need such a
> shared-file-system ?
>
> No, for a single node, checkpointing with 1.4.1 should work (it works
> for me, at least). If you're using a single node, then your problem is
> probably not related to the bug report I posted.
>
>
> Regards,
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> --
> Sincerely Yours,
> Hideyuki Jitsumoto (jitum...@gsic.titech.ac.jp)
> Tokyo Institute of Technology
> Global Scientific Information and Computing center (Matsuoka Lab.)
>


Re: [OMPI users] OpenMPI Checkpoint/Restart is failed

2010-05-19 Thread Hideyuki Jitsumoto
Hi Josh,

Thank you for your replying.
I tried to patch a Ticket #2139 to openmpi-1.4.1
and to install all of the elements from the very beginning.
Then I got a correct work.
Probably there are some faults on my environment preparation.

# I cannot reproduce the environment when I got failure.
# I'm very sorry that I cannot find truly factors of this malfunction
# and cannot send any information.
# Now I use openmpi-1.4.2, it works well without any patch. (except
for ompi_info)

>> In addition, when I confirmed open_info output as your demo movie, I got
>> "MCA crs: none (MCA v2.0, API v2.0, Component v1.4.1)" (open_info.output)
>
> This is actually a known bug with ompi_info. I have a fix in the works for
> it, and should be available soon. Until then the ticket is linked below:
>  https://svn.open-mpi.org/trac/ompi/ticket/2097
Thank you, I'll try it.


On Wed, May 19, 2010 at 3:46 AM, Josh Hursey  wrote:
> (Sorry for the delay in replying, more below)
>
> On Apr 12, 2010, at 6:36 AM, Hideyuki Jitsumoto wrote:
>
>> Hi Members,
>>
>> I tried to use checkpoint/restart by openmpi.
>> But I can not get collect checkpoint data.
>> I prepared execution environment as follows, the strings in () mean
>> name of output file which attached on next e-mail ( for mail size
>> limitation ):
>>
>> 1. installed BLCR and checked BLCR is working correctly by "make check"
>> 2. executed ./configure with some parameters on openMPI source dir
>> (config.output / config.log)
>> 3. executed make and make install (make.output.2 / install.output.2)
>> 4. confirmed that mca_crs_blcr.[la|so], mca_crs_self.[la|so] on
>> /${INSTALL_DIR}/lib/openmpi
>> 5. make ~/.openmpi/mca-params.conf (mca-params.conf)
>> 6. compiled NPB and executed with -am ft-enable-cr
>> 7. invoked ompi-checkpoint 
>>
>> As result, I got the message "Checkpoint failed: no processes
>> checkpointed."
>> (cr_test_cg)
>
> It is unclear from the output what caused the checkpoint to fail. Can you
> turn on some verbose arguments and send me the output?
>
> Put the following options in you ~/.openmpi/mca-params.conf:
> #---
> orte_debug_daemons=1
> snapc_full_verbose=20
> crs_base_verbose=10
> opal_cr_verbose=10
> #---
>
>
>>
>> In addition, when I confirmed open_info output as your demo movie, I got
>> "MCA crs: none (MCA v2.0, API v2.0, Component v1.4.1)" (open_info.output)
>
> This is actually a known bug with ompi_info. I have a fix in the works for
> it, and should be available soon. Until then the ticket is linked below:
>  https://svn.open-mpi.org/trac/ompi/ticket/2097
>
>>
>> How should I do for checkpointing ?
>> Any guidance in this regard would be highly appreciated.
>
> Let's see what the verbose output tells us, and go from there. What version
> of BLCR are you using?
>
> -- Josh
>
>>
>> Thank you,
>> Hideyuki
>>
>> --
>> Sincerely Yours,
>> Hideyuki Jitsumoto (jitum...@gsic.titech.ac.jp)
>> Tokyo Institute of Technology
>> Global Scientific Information and Computing center (Matsuoka Lab.)
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Sincerely Yours,
Hideyuki Jitsumoto (jitum...@gsic.titech.ac.jp)
Tokyo Institute of Technology
Global Scientific Information and Computing center (Matsuoka Lab.)



Re: [OMPI users] OpenMPI Checkpoint/Restart is failed

2010-05-18 Thread Josh Hursey

(Sorry for the delay in replying, more below)

On Apr 12, 2010, at 6:36 AM, Hideyuki Jitsumoto wrote:


Hi Members,

I tried to use checkpoint/restart by openmpi.
But I can not get collect checkpoint data.
I prepared execution environment as follows, the strings in () mean
name of output file which attached on next e-mail ( for mail size
limitation ):

1. installed BLCR and checked BLCR is working correctly by "make  
check"

2. executed ./configure with some parameters on openMPI source dir
(config.output / config.log)
3. executed make and make install (make.output.2 / install.output.2)
4. confirmed that mca_crs_blcr.[la|so], mca_crs_self.[la|so] on
/${INSTALL_DIR}/lib/openmpi
5. make ~/.openmpi/mca-params.conf (mca-params.conf)
6. compiled NPB and executed with -am ft-enable-cr
7. invoked ompi-checkpoint 

As result, I got the message "Checkpoint failed: no processes  
checkpointed."

(cr_test_cg)


It is unclear from the output what caused the checkpoint to fail. Can  
you turn on some verbose arguments and send me the output?


Put the following options in you ~/.openmpi/mca-params.conf:
#---
orte_debug_daemons=1
snapc_full_verbose=20
crs_base_verbose=10
opal_cr_verbose=10
#---




In addition, when I confirmed open_info output as your demo movie, I  
got
"MCA crs: none (MCA v2.0, API v2.0, Component  
v1.4.1)" (open_info.output)


This is actually a known bug with ompi_info. I have a fix in the works  
for it, and should be available soon. Until then the ticket is linked  
below:

  https://svn.open-mpi.org/trac/ompi/ticket/2097



How should I do for checkpointing ?
Any guidance in this regard would be highly appreciated.


Let's see what the verbose output tells us, and go from there. What  
version of BLCR are you using?


-- Josh



Thank you,
Hideyuki

--
Sincerely Yours,
Hideyuki Jitsumoto (jitum...@gsic.titech.ac.jp)
Tokyo Institute of Technology
Global Scientific Information and Computing center (Matsuoka Lab.)
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] OpenMPI Checkpoint/Restart is failed

2010-04-14 Thread Fernando Lemos
On Wed, Apr 14, 2010 at 5:25 AM, Hideyuki Jitsumoto
 wrote:
> Fernando,
>
> Thank you for your reply.
> I tried to patch the file you mentioned, but the output did not change.

I didn't test the patch, tbh. I'm using 1.5 nightly snapshots, and it
works great.

>>Are you using a shared file system? You need to use a shared file
> system for checkpointing with 1.4.1:
> What is the shared file system ? do you mean NFS, Lustre and so on ?
> (I'm sorry about my ignorance...)

Something like NFS, yea.

> If I use only one node for application, do I need such a shared-file-system ?

No, for a single node, checkpointing with 1.4.1 should work (it works
for me, at least). If you're using a single node, then your problem is
probably not related to the bug report I posted.


Regards,


Re: [OMPI users] OpenMPI Checkpoint/Restart is failed

2010-04-14 Thread Hideyuki Jitsumoto
Fernando,

Thank you for your reply.
I tried to patch the file you mentioned, but the output did not change.

>Are you using a shared file system? You need to use a shared file
system for checkpointing with 1.4.1:
What is the shared file system ? do you mean NFS, Lustre and so on ?
(I'm sorry about my ignorance...)

If I use only one node for application, do I need such a shared-file-system ?


On Mon, Apr 12, 2010 at 9:41 PM, Fernando Lemos  wrote:
> On Mon, Apr 12, 2010 at 7:36 AM, Hideyuki Jitsumoto
>  wrote:
>> Hi Members,
>>
>> I tried to use checkpoint/restart by openmpi.
>> But I can not get collect checkpoint data.
>> I prepared execution environment as follows, the strings in () mean
>> name of output file which attached on next e-mail ( for mail size
>> limitation ):
>>
>> 1. installed BLCR and checked BLCR is working correctly by "make check"
>> 2. executed ./configure with some parameters on openMPI source dir
>> (config.output / config.log)
>> 3. executed make and make install (make.output.2 / install.output.2)
>> 4. confirmed that mca_crs_blcr.[la|so], mca_crs_self.[la|so] on
>> /${INSTALL_DIR}/lib/openmpi
>> 5. make ~/.openmpi/mca-params.conf (mca-params.conf)
>> 6. compiled NPB and executed with -am ft-enable-cr
>> 7. invoked ompi-checkpoint 
>>
>> As result, I got the message "Checkpoint failed: no processes checkpointed."
>> (cr_test_cg)
>
> Are you using a shared file system? You need to use a shared file
> system for checkpointing with 1.4.1:
>
> https://svn.open-mpi.org/trac/ompi/ticket/2139
>
> Regards,
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
Sincerely Yours,
Hideyuki Jitsumoto (jitum...@gsic.titech.ac.jp)
Tokyo Institute of Technology
Global Scientific Information and Computing center (Matsuoka Lab.)


Re: [OMPI users] OpenMPI Checkpoint/Restart is failed

2010-04-12 Thread Fernando Lemos
On Mon, Apr 12, 2010 at 7:36 AM, Hideyuki Jitsumoto
 wrote:
> Hi Members,
>
> I tried to use checkpoint/restart by openmpi.
> But I can not get collect checkpoint data.
> I prepared execution environment as follows, the strings in () mean
> name of output file which attached on next e-mail ( for mail size
> limitation ):
>
> 1. installed BLCR and checked BLCR is working correctly by "make check"
> 2. executed ./configure with some parameters on openMPI source dir
> (config.output / config.log)
> 3. executed make and make install (make.output.2 / install.output.2)
> 4. confirmed that mca_crs_blcr.[la|so], mca_crs_self.[la|so] on
> /${INSTALL_DIR}/lib/openmpi
> 5. make ~/.openmpi/mca-params.conf (mca-params.conf)
> 6. compiled NPB and executed with -am ft-enable-cr
> 7. invoked ompi-checkpoint 
>
> As result, I got the message "Checkpoint failed: no processes checkpointed."
> (cr_test_cg)

Are you using a shared file system? You need to use a shared file
system for checkpointing with 1.4.1:

https://svn.open-mpi.org/trac/ompi/ticket/2139

Regards,


Re: [OMPI users] OpenMPI Checkpoint/Restart is failed

2010-04-12 Thread Hideyuki Jitsumoto
I attache a file (2/2) on this email as mentioned previous one.

Thank you,
Hideyuki


*
** **
** WARNING:  This email contains an attachment of a very suspicious type.  **
** You are urged NOT to open this attachment unless you are absolutely **
** sure it is legitimate.  Opening this attachment may cause irreparable   **
** damage to your computer and your files.  If you have any questions  **
** about the validity of this message, PLEASE SEEK HELP BEFORE OPENING IT. **
** **
** This warning was added by the IU Computer Science Dept. mail scanner.   **
*


<>


Re: [OMPI users] OpenMPI Checkpoint/Restart is failed

2010-04-12 Thread Hideyuki Jitsumoto
I attache a file (1/2) on this email as mentioned previous one.
I'm very sorry to send the large log file.

Thank you,
Hideyuki


*
** **
** WARNING:  This email contains an attachment of a very suspicious type.  **
** You are urged NOT to open this attachment unless you are absolutely **
** sure it is legitimate.  Opening this attachment may cause irreparable   **
** damage to your computer and your files.  If you have any questions  **
** about the validity of this message, PLEASE SEEK HELP BEFORE OPENING IT. **
** **
** This warning was added by the IU Computer Science Dept. mail scanner.   **
*


<>


[OMPI users] OpenMPI Checkpoint/Restart is failed

2010-04-12 Thread Hideyuki Jitsumoto
Hi Members,

I tried to use checkpoint/restart by openmpi.
But I can not get collect checkpoint data.
I prepared execution environment as follows, the strings in () mean
name of output file which attached on next e-mail ( for mail size
limitation ):

1. installed BLCR and checked BLCR is working correctly by "make check"
2. executed ./configure with some parameters on openMPI source dir
(config.output / config.log)
3. executed make and make install (make.output.2 / install.output.2)
4. confirmed that mca_crs_blcr.[la|so], mca_crs_self.[la|so] on
/${INSTALL_DIR}/lib/openmpi
5. make ~/.openmpi/mca-params.conf (mca-params.conf)
6. compiled NPB and executed with -am ft-enable-cr
7. invoked ompi-checkpoint 

As result, I got the message "Checkpoint failed: no processes checkpointed."
(cr_test_cg)

In addition, when I confirmed open_info output as your demo movie, I got
"MCA crs: none (MCA v2.0, API v2.0, Component v1.4.1)" (open_info.output)

How should I do for checkpointing ?
Any guidance in this regard would be highly appreciated.

Thank you,
Hideyuki

--
Sincerely Yours,
Hideyuki Jitsumoto (jitum...@gsic.titech.ac.jp)
Tokyo Institute of Technology
Global Scientific Information and Computing center (Matsuoka Lab.)