Re: [OMPI users] Checkpoint from inside MPI program with OpenMPI 1.4.2 ?

2011-10-27 Thread Nguyen Toan
Dear Josh,
This will really help a lot. Thank you for the support.

Best Regards,
Nguyen Toan

On Wed, Oct 26, 2011 at 9:20 PM, Josh Hursey <jjhur...@open-mpi.org> wrote:

> Since this would be a new feature for 1.4, we cannot move it since the
> 1.4 branch is for bug fixes only. However, we may be able to add it to
> 1.5. I filed a ticket if you want to track that progress:
>  https://svn.open-mpi.org/trac/ompi/ticket/2895
>
> -- Josh
>
>
> On Tue, Oct 25, 2011 at 11:52 PM, Nguyen Toan <nguyentoan1...@gmail.com>
> wrote:
> > Dear Josh,
> > Thank you. I will test the 1.7 trunk as you suggested.
> > Also I want to ask if we can add this interface to OpenMPI 1.4.2,
> > because my applications are mainly involved in this version.
> > Regards,
> > Nguyen Toan
> > On Wed, Oct 26, 2011 at 3:25 AM, Josh Hursey <jjhur...@open-mpi.org>
> wrote:
> >>
> >> Open MPI (trunk/1.7 - not 1.4 or 1.5) provides an application level
> >> interface to request a checkpoint of an application. This API is
> >> defined on the following website:
> >>  http://osl.iu.edu/research/ft/ompi-cr/api.php#api-cr_checkpoint
> >>
> >> This will behave the same as if you requested the checkpoint of the
> >> job from the command line.
> >>
> >> -- Josh
> >>
> >> On Mon, Oct 24, 2011 at 12:37 PM, Nguyen Toan <nguyentoan1...@gmail.com
> >
> >> wrote:
> >> > Dear all,
> >> > I want to automatically checkpoint an MPI program with OpenMPI ( I'm
> >> > currently using 1.4.2 version with BLCR 0.8.2),
> >> > not by manually typing ompi-checkpoint command line from another
> >> > terminal.
> >> > So I would like to know if there is a way to call checkpoint function
> >> > from
> >> > inside an MPI program
> >> > with OpenMPI or how to do that.
> >> > Any ideas are very appreciated.
> >> > Regards,
> >> > Nguyen Toan
> >> > ___
> >> > users mailing list
> >> > us...@open-mpi.org
> >> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> >
> >>
> >>
> >>
> >> --
> >> Joshua Hursey
> >> Postdoctoral Research Associate
> >> Oak Ridge National Laboratory
> >> http://users.nccs.gov/~jjhursey
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
>
>
>
> --
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Checkpoint from inside MPI program with OpenMPI 1.4.2 ?

2011-10-26 Thread Nguyen Toan
Dear Josh,

Thank you. I will test the 1.7 trunk as you suggested.
Also I want to ask if we can add this interface to OpenMPI 1.4.2,
because my applications are mainly involved in this version.

Regards,
Nguyen Toan

On Wed, Oct 26, 2011 at 3:25 AM, Josh Hursey <jjhur...@open-mpi.org> wrote:

> Open MPI (trunk/1.7 - not 1.4 or 1.5) provides an application level
> interface to request a checkpoint of an application. This API is
> defined on the following website:
>  http://osl.iu.edu/research/ft/ompi-cr/api.php#api-cr_checkpoint
>
> This will behave the same as if you requested the checkpoint of the
> job from the command line.
>
> -- Josh
>
> On Mon, Oct 24, 2011 at 12:37 PM, Nguyen Toan <nguyentoan1...@gmail.com>
> wrote:
> > Dear all,
> > I want to automatically checkpoint an MPI program with OpenMPI ( I'm
> > currently using 1.4.2 version with BLCR 0.8.2),
> > not by manually typing ompi-checkpoint command line from another
> terminal.
> > So I would like to know if there is a way to call checkpoint function
> from
> > inside an MPI program
> > with OpenMPI or how to do that.
> > Any ideas are very appreciated.
> > Regards,
> > Nguyen Toan
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
>
>
>
> --
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


[OMPI users] Checkpoint from inside MPI program with OpenMPI 1.4.2 ?

2011-10-24 Thread Nguyen Toan
Dear all,

I want to automatically checkpoint an MPI program with OpenMPI ( I'm
currently using 1.4.2 version with BLCR 0.8.2),
not by manually typing ompi-checkpoint command line from another terminal.
So I would like to know if there is a way to call checkpoint function from
inside an MPI program
with OpenMPI or how to do that.
Any ideas are very appreciated.

Regards,
Nguyen Toan


Re: [OMPI users] openmpi self checkpointing - error while running example

2011-04-06 Thread Nguyen Toan
Hi Roman,

It seems that you misunderstand the parameter "-machinefile".
Following this parameter shoud be a file containing a list of machines
which your MPI application will be run on. For example, you want to
run your app on 2 nodes, named "node1" and "node2", then this file, let call
it "MACHINES_FILE", should look like this:

node1
node2

Now try to checkpoint and restart again with "-machinefile MACHINES_FILE".
Hope it works.

On Wed, Apr 6, 2011 at 9:13 PM, Hellmüller Roman <hro...@student.ethz.ch>wrote:

> Hi Toan
>
> Thx for your suggestion. It gives me the following result, which does not
> tell anything more.
>
> hroman@cbl1 ~/checkpoints $ ompi-restart -v  -machinefile
> ../semesterthesis/code/code2_self_example/my-hroman-cr-file.ckpt   om
> pi_global_snapshot_28952.ckpt/
> [cbl1:28974] Checking for the existence of
> (/home/hroman/checkpoints/ompi_global_snapshot_28952.ckpt)
> [cbl1:28974] Restarting from file (ompi_global_snapshot_28952.ckpt/)
> [cbl1:28974]  Exec in self
> ssh: connect to host 15 port 22: Invalid argument
> --
> A daemon (pid 28975) died unexpectedly with status 255 while attempting
> to launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --
> --
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
> hroman@cbl1 ~/checkpoints $ echo $LD_LIBRARY_PATH
>
> /cluster/sw/blcr/0.8.2/x86_64/gcc//lib:/cluster/sw/openmpi/1.5.3_ft/x86_64/gcc/lib:/opt/intel/Compiler/11.1/056/lib/intel64
>
> The library path seems to be ok or should it look different? do you have
> another idea?
> cheers
> roman
>
> 
> Von: users-boun...@open-mpi.org [users-boun...@open-mpi.org]" im Auftrag
> von "Nguyen Toan [nguyentoan1...@gmail.com]
> Gesendet: Mittwoch, 6. April 2011 13:20
> Bis: Open MPI Users
> Betreff: Re: [OMPI users] openmpi self checkpointing - error while running
> example
>
> Hi Roman,
>
> Did you try to checkpoint and restart with the parameter "-machinefile". It
> may work.
>
> Regards,
> Nguyen Toan
>
> On Wed, Apr 6, 2011 at 7:05 PM, Hellmüller Roman <hro...@student.ethz.ch
> <mailto:hro...@student.ethz.ch>> wrote:
> Hi
>
> I'm trying to get fault tolerant ompi running on our cluster for my
> semesterthesis.
>
> Build & compile were successful, blcr checkpointing works. openmpi 1.5.3,
> blcr 0.8.2
>
> Now i'm trying to set up the SELF checkpointing. the example from
> http://osl.iu.edu/research/ft/ompi-cr/examples.php does not work. I can
> run the application and also do checkpoints, but restarting won't work.  I
> got the following error by doning as sugested:
>
> mpicc my-app.c -export -export-dynamic -o my-app
>
> mpirun -np 2 -am ft-enable-cr -mca crs_self_prefix my_personal my-app
>
> hroman@cbl1 ~ $ ompi-restart ompi_global_snapshot_27167.ckpt/
> --
> Error: Unable to obtain the proper restart command to restart from the
>  checkpoint file (opal_snapshot_0.ckpt). Returned -1.
>
> --
> --
> Error: Unable to obtain the proper restart command to restart from the
>  checkpoint file (opal_snapshot_1.ckpt). Returned -1.
>
> --
> --
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
>
> i also tryed around with setting the path in the example file (restart_path
> variable), changing the checkpoint directorys, and running the application
> in different directorys...
>
> do you have an idea where the error could be?
>
> here http://n.ethz.ch/~hroman/downloads/ompi_mailinglist.tar.gz<
> http://n.ethz.ch/%7Ehroman/downloads/ompi_maili

Re: [OMPI users] openmpi self checkpointing - error while running example

2011-04-06 Thread Nguyen Toan
Hi Roman,

Did you try to checkpoint and restart with the parameter "-machinefile". It
may work.

Regards,
Nguyen Toan

On Wed, Apr 6, 2011 at 7:05 PM, Hellmüller Roman <hro...@student.ethz.ch>wrote:

> Hi
>
> I'm trying to get fault tolerant ompi running on our cluster for my
> semesterthesis.
>
> Build & compile were successful, blcr checkpointing works. openmpi 1.5.3,
> blcr 0.8.2
>
> Now i'm trying to set up the SELF checkpointing. the example from
> http://osl.iu.edu/research/ft/ompi-cr/examples.php does not work. I can
> run the application and also do checkpoints, but restarting won't work.  I
> got the following error by doning as sugested:
>
> mpicc my-app.c -export -export-dynamic -o my-app
>
> mpirun -np 2 -am ft-enable-cr -mca crs_self_prefix my_personal my-app
>
> hroman@cbl1 ~ $ ompi-restart ompi_global_snapshot_27167.ckpt/
> --
> Error: Unable to obtain the proper restart command to restart from the
>   checkpoint file (opal_snapshot_0.ckpt). Returned -1.
>
> --
> --
> Error: Unable to obtain the proper restart command to restart from the
>   checkpoint file (opal_snapshot_1.ckpt). Returned -1.
>
> --
> --
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
>
> i also tryed around with setting the path in the example file (restart_path
> variable), changing the checkpoint directorys, and running the application
> in different directorys...
>
> do you have an idea where the error could be?
>
> here http://n.ethz.ch/~hroman/downloads/ompi_mailinglist.tar.gz<
> http://n.ethz.ch/%7Ehroman/downloads/ompi_mailinglist.tar.gz> (40MB)
> you'll find the library and the build of openmpi & blcr as well as the env
> variables and the output of ompi_info. there is one for the login and the
> other for the compute nodes due to different kernels.  and here
> http://n.ethz.ch/~hroman/downloads/ompi_global_snapshot_27167.ckpt.tar.gz<
> http://n.ethz.ch/%7Ehroman/downloads/ompi_global_snapshot_27167.ckpt.tar.gz>
> there is the produced checkpoint. please let me know if more outputs are
> needed.
>
> cheers
> roman
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Unknown overhead in "mpirun -am ft-enable-cr"

2011-03-03 Thread Nguyen Toan
Thanks Josh.
Actually I also tested with the Himeno
benchmark<http://accc.riken.jp/assets/files/himenob_loadmodule/himenoBMT_c_mpi.lzh>and
got the same problem, so I think this could be a bug.
Hope this information also helps.

Regards,
Nguyen Toan

On Fri, Mar 4, 2011 at 12:04 AM, Joshua Hursey <jjhur...@open-mpi.org>wrote:

> Thanks for the program. I created a ticket for this performance bug and
> attached the tarball to the ticket:
>  https://svn.open-mpi.org/trac/ompi/ticket/2743
>
> I do not know exactly when I will be able to get back to this, but
> hopefully soon. I added you to the CC so you should receive any progress
> updates regarding the ticket as we move forward.
>
> Thanks again,
> Josh
>
> On Mar 3, 2011, at 2:12 AM, Nguyen Toan wrote:
>
> > Dear Josh,
> >
> > Attached with this email is a small program that illustrates the
> performance problem. You can find simple instructions in the README file.
> > There are also 2 sample result files (cpu.256^3.8N.*) which show the
> execution time difference between 2 cases.
> > Hope you can take some time to find the problem.
> > Thanks for your kindness.
> >
> > Best Regards,
> > Nguyen Toan
> >
> > On Wed, Mar 2, 2011 at 3:00 AM, Joshua Hursey <jjhur...@open-mpi.org>
> wrote:
> > I have not had the time to look into the performance problem yet, and
> probably won't for a little while. Can you send me a small program that
> illustrates the performance problem, and I'll file a bug so we don't lose
> track of it.
> >
> > Thanks,
> > Josh
> >
> > On Feb 25, 2011, at 1:31 PM, Nguyen Toan wrote:
> >
> > > Dear Josh,
> > >
> > > Did you find out the problem? I still cannot progress anything.
> > > Hope to hear some good news from you.
> > >
> > > Regards,
> > > Nguyen Toan
> > >
> > > On Sun, Feb 13, 2011 at 3:04 PM, Nguyen Toan <nguyentoan1...@gmail.com>
> wrote:
> > > Hi Josh,
> > >
> > > I tried the MCA parameter you mentioned but it did not help, the
> unknown overhead still exists.
> > > Here I attach the output of 'ompi_info', both version 1.5 and 1.5.1.
> > > Hope you can find out the problem.
> > > Thank you.
> > >
> > > Regards,
> > > Nguyen Toan
> > >
> > > On Wed, Feb 9, 2011 at 11:08 PM, Joshua Hursey <jjhur...@open-mpi.org>
> wrote:
> > > It looks like the logic in the configure script is turning on the FT
> thread for you when you specify both '--with-ft=cr' and
> '--enable-mpi-threads'.
> > >
> > > Can you send me the output of 'ompi_info'? Can you also try the MCA
> parameter that I mentioned earlier to see if that changes the performance?
> > >
> > > I there are many non-blocking sends and receives, there might be
> performance bug with the way the point-to-point wrapper is tracking request
> objects. If the above MCA parameter does not help the situation, let me know
> and I might be able to take a look at this next week.
> > >
> > > Thanks,
> > > Josh
> > >
> > > On Feb 9, 2011, at 1:40 AM, Nguyen Toan wrote:
> > >
> > > > Hi Josh,
> > > > Thanks for the reply. I did not use the '--enable-ft-thread' option.
> Here is my build options:
> > > >
> > > > CFLAGS=-g \
> > > > ./configure \
> > > > --with-ft=cr \
> > > > --enable-mpi-threads \
> > > > --with-blcr=/home/nguyen/opt/blcr \
> > > > --with-blcr-libdir=/home/nguyen/opt/blcr/lib \
> > > > --prefix=/home/nguyen/opt/openmpi \
> > > > --with-openib \
> > > > --enable-mpirun-prefix-by-default
> > > >
> > > > My application requires lots of communication in every loop, focusing
> on MPI_Isend, MPI_Irecv and MPI_Wait. Also I want to make only one
> checkpoint per application execution for my purpose, but the unknown
> overhead exists even when no checkpoint was taken.
> > > >
> > > > Do you have any other idea?
> > > >
> > > > Regards,
> > > > Nguyen Toan
> > > >
> > > >
> > > > On Wed, Feb 9, 2011 at 12:41 AM, Joshua Hursey <
> jjhur...@open-mpi.org> wrote:
> > > > There are a few reasons why this might be occurring. Did you build
> with the '--enable-ft-thread' option?
> > > >
> > > > If so, it looks like I didn't move over the thread_sleep_wait
> adjustment from the trunk - the thread was being a bit too aggressive. Try
> adding the following to your command line 

Re: [OMPI users] Unknown overhead in "mpirun -am ft-enable-cr"

2011-02-25 Thread Nguyen Toan
Dear Josh,

Did you find out the problem? I still cannot progress anything.
Hope to hear some good news from you.

Regards,
Nguyen Toan

On Sun, Feb 13, 2011 at 3:04 PM, Nguyen Toan <nguyentoan1...@gmail.com>wrote:

> Hi Josh,
>
> I tried the MCA parameter you mentioned but it did not help, the unknown
> overhead still exists.
> Here I attach the output of 'ompi_info', both version 1.5 and 1.5.1.
> Hope you can find out the problem.
> Thank you.
>
> Regards,
> Nguyen Toan
>
> On Wed, Feb 9, 2011 at 11:08 PM, Joshua Hursey <jjhur...@open-mpi.org>wrote:
>
>> It looks like the logic in the configure script is turning on the FT
>> thread for you when you specify both '--with-ft=cr' and
>> '--enable-mpi-threads'.
>>
>> Can you send me the output of 'ompi_info'? Can you also try the MCA
>> parameter that I mentioned earlier to see if that changes the performance?
>>
>> I there are many non-blocking sends and receives, there might be
>> performance bug with the way the point-to-point wrapper is tracking request
>> objects. If the above MCA parameter does not help the situation, let me know
>> and I might be able to take a look at this next week.
>>
>> Thanks,
>> Josh
>>
>> On Feb 9, 2011, at 1:40 AM, Nguyen Toan wrote:
>>
>> > Hi Josh,
>> > Thanks for the reply. I did not use the '--enable-ft-thread' option.
>> Here is my build options:
>> >
>> > CFLAGS=-g \
>> > ./configure \
>> > --with-ft=cr \
>> > --enable-mpi-threads \
>> > --with-blcr=/home/nguyen/opt/blcr \
>> > --with-blcr-libdir=/home/nguyen/opt/blcr/lib \
>> > --prefix=/home/nguyen/opt/openmpi \
>> > --with-openib \
>> > --enable-mpirun-prefix-by-default
>> >
>> > My application requires lots of communication in every loop, focusing on
>> MPI_Isend, MPI_Irecv and MPI_Wait. Also I want to make only one checkpoint
>> per application execution for my purpose, but the unknown overhead exists
>> even when no checkpoint was taken.
>> >
>> > Do you have any other idea?
>> >
>> > Regards,
>> > Nguyen Toan
>> >
>> >
>> > On Wed, Feb 9, 2011 at 12:41 AM, Joshua Hursey <jjhur...@open-mpi.org>
>> wrote:
>> > There are a few reasons why this might be occurring. Did you build with
>> the '--enable-ft-thread' option?
>> >
>> > If so, it looks like I didn't move over the thread_sleep_wait adjustment
>> from the trunk - the thread was being a bit too aggressive. Try adding the
>> following to your command line options, and see if it changes the
>> performance.
>> >  "-mca opal_cr_thread_sleep_wait 1000"
>> >
>> > There are other places to look as well depending on how frequently your
>> application communicates, how often you checkpoint, process layout, ... But
>> usually the aggressive nature of the thread is the main problem.
>> >
>> > Let me know if that helps.
>> >
>> > -- Josh
>> >
>> > On Feb 8, 2011, at 2:50 AM, Nguyen Toan wrote:
>> >
>> > > Hi all,
>> > >
>> > > I am using the latest version of OpenMPI (1.5.1) and BLCR (0.8.2).
>> > > I found that when running an application,which uses MPI_Isend,
>> MPI_Irecv and MPI_Wait,
>> > > enabling C/R, i.e using "-am ft-enable-cr", the application runtime is
>> much longer than the normal execution with mpirun (no checkpoint was taken).
>> > > This overhead becomes larger when the normal execution runtime is
>> longer.
>> > > Does anybody have any idea about this overhead, and how to eliminate
>> it?
>> > > Thanks.
>> > >
>> > > Regards,
>> > > Nguyen
>> > > ___
>> > > users mailing list
>> > > us...@open-mpi.org
>> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> > 
>> > Joshua Hursey
>> > Postdoctoral Research Associate
>> > Oak Ridge National Laboratory
>> > http://users.nccs.gov/~jjhursey
>> >
>> >
>> > ___
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>> >
>> > ___
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>> 
>> Joshua Hursey
>> Postdoctoral Research Associate
>> Oak Ridge National Laboratory
>> http://users.nccs.gov/~jjhursey
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>


Re: [OMPI users] Unknown overhead in "mpirun -am ft-enable-cr"

2011-02-13 Thread Nguyen Toan
Hi Josh,

I tried the MCA parameter you mentioned but it did not help, the unknown
overhead still exists.
Here I attach the output of 'ompi_info', both version 1.5 and 1.5.1.
Hope you can find out the problem.
Thank you.

Regards,
Nguyen Toan

On Wed, Feb 9, 2011 at 11:08 PM, Joshua Hursey <jjhur...@open-mpi.org>wrote:

> It looks like the logic in the configure script is turning on the FT thread
> for you when you specify both '--with-ft=cr' and '--enable-mpi-threads'.
>
> Can you send me the output of 'ompi_info'? Can you also try the MCA
> parameter that I mentioned earlier to see if that changes the performance?
>
> I there are many non-blocking sends and receives, there might be
> performance bug with the way the point-to-point wrapper is tracking request
> objects. If the above MCA parameter does not help the situation, let me know
> and I might be able to take a look at this next week.
>
> Thanks,
> Josh
>
> On Feb 9, 2011, at 1:40 AM, Nguyen Toan wrote:
>
> > Hi Josh,
> > Thanks for the reply. I did not use the '--enable-ft-thread' option. Here
> is my build options:
> >
> > CFLAGS=-g \
> > ./configure \
> > --with-ft=cr \
> > --enable-mpi-threads \
> > --with-blcr=/home/nguyen/opt/blcr \
> > --with-blcr-libdir=/home/nguyen/opt/blcr/lib \
> > --prefix=/home/nguyen/opt/openmpi \
> > --with-openib \
> > --enable-mpirun-prefix-by-default
> >
> > My application requires lots of communication in every loop, focusing on
> MPI_Isend, MPI_Irecv and MPI_Wait. Also I want to make only one checkpoint
> per application execution for my purpose, but the unknown overhead exists
> even when no checkpoint was taken.
> >
> > Do you have any other idea?
> >
> > Regards,
> > Nguyen Toan
> >
> >
> > On Wed, Feb 9, 2011 at 12:41 AM, Joshua Hursey <jjhur...@open-mpi.org>
> wrote:
> > There are a few reasons why this might be occurring. Did you build with
> the '--enable-ft-thread' option?
> >
> > If so, it looks like I didn't move over the thread_sleep_wait adjustment
> from the trunk - the thread was being a bit too aggressive. Try adding the
> following to your command line options, and see if it changes the
> performance.
> >  "-mca opal_cr_thread_sleep_wait 1000"
> >
> > There are other places to look as well depending on how frequently your
> application communicates, how often you checkpoint, process layout, ... But
> usually the aggressive nature of the thread is the main problem.
> >
> > Let me know if that helps.
> >
> > -- Josh
> >
> > On Feb 8, 2011, at 2:50 AM, Nguyen Toan wrote:
> >
> > > Hi all,
> > >
> > > I am using the latest version of OpenMPI (1.5.1) and BLCR (0.8.2).
> > > I found that when running an application,which uses MPI_Isend,
> MPI_Irecv and MPI_Wait,
> > > enabling C/R, i.e using "-am ft-enable-cr", the application runtime is
> much longer than the normal execution with mpirun (no checkpoint was taken).
> > > This overhead becomes larger when the normal execution runtime is
> longer.
> > > Does anybody have any idea about this overhead, and how to eliminate
> it?
> > > Thanks.
> > >
> > > Regards,
> > > Nguyen
> > > ___
> > > users mailing list
> > > us...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > 
> > Joshua Hursey
> > Postdoctoral Research Associate
> > Oak Ridge National Laboratory
> > http://users.nccs.gov/~jjhursey
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> 
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


ompi_info.1.5
Description: Binary data


ompi_info.1.5.1
Description: Binary data


Re: [OMPI users] Unknown overhead in "mpirun -am ft-enable-cr"

2011-02-09 Thread Nguyen Toan
Hi Josh,
Thanks for the reply. I did not use the '--enable-ft-thread' option. Here is
my build options:

CFLAGS=-g \
./configure \
--with-ft=cr \
--enable-mpi-threads \
--with-blcr=/home/nguyen/opt/blcr \
--with-blcr-libdir=/home/nguyen/opt/blcr/lib \
--prefix=/home/nguyen/opt/openmpi \
--with-openib \
--enable-mpirun-prefix-by-default

My application requires lots of communication in every loop, focusing on
MPI_Isend, MPI_Irecv and MPI_Wait. Also I want to make only one checkpoint
per application execution for my purpose, but the unknown overhead exists
even when no checkpoint was taken.

Do you have any other idea?

Regards,
Nguyen Toan


On Wed, Feb 9, 2011 at 12:41 AM, Joshua Hursey <jjhur...@open-mpi.org>wrote:

> There are a few reasons why this might be occurring. Did you build with the
> '--enable-ft-thread' option?
>
> If so, it looks like I didn't move over the thread_sleep_wait adjustment
> from the trunk - the thread was being a bit too aggressive. Try adding the
> following to your command line options, and see if it changes the
> performance.
>  "-mca opal_cr_thread_sleep_wait 1000"
>
> There are other places to look as well depending on how frequently your
> application communicates, how often you checkpoint, process layout, ... But
> usually the aggressive nature of the thread is the main problem.
>
> Let me know if that helps.
>
> -- Josh
>
> On Feb 8, 2011, at 2:50 AM, Nguyen Toan wrote:
>
> > Hi all,
> >
> > I am using the latest version of OpenMPI (1.5.1) and BLCR (0.8.2).
> > I found that when running an application,which uses MPI_Isend, MPI_Irecv
> and MPI_Wait,
> > enabling C/R, i.e using "-am ft-enable-cr", the application runtime is
> much longer than the normal execution with mpirun (no checkpoint was taken).
> > This overhead becomes larger when the normal execution runtime is longer.
> > Does anybody have any idea about this overhead, and how to eliminate it?
> > Thanks.
> >
> > Regards,
> > Nguyen
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> 
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


[OMPI users] Unknown overhead in "mpirun -am ft-enable-cr"

2011-02-08 Thread Nguyen Toan
Hi all,

I am using the latest version of OpenMPI (1.5.1) and BLCR (0.8.2).
I found that when running an application,which uses MPI_Isend, MPI_Irecv and
MPI_Wait,
enabling C/R, i.e using "-am ft-enable-cr", the application runtime is much
longer than the normal execution with mpirun (no checkpoint was taken).
This overhead becomes larger when the normal execution runtime is longer.
Does anybody have any idea about this overhead, and how to eliminate it?
Thanks.

Regards,
Nguyen


Re: [OMPI users] mpirun error in OpenMPI 1.5

2010-12-08 Thread Nguyen Toan
Dear Ralph,

Thank you for your reply. I did check the ld_library_path and recompile with
the new version and it worked perfectly.
Thank you again.

Best Regards,
Toan

On Thu, Dec 9, 2010 at 12:30 AM, Ralph Castain <r...@open-mpi.org> wrote:

> That could mean you didn't recompile the code using the new version of
> OMPI. The 1.4 and 1.5 series are not binary compatible - you have to
> recompile your code.
>
> If you did recompile, you may be getting version confusion on the backend
> nodes - you should check your ld_library_path and ensure it is pointing to
> the 1.5 series install.
>
> On Dec 8, 2010, at 8:02 AM, Nguyen Toan wrote:
>
> > Dear all,
> >
> > I am having a problem while running mpirun in OpenMPI 1.5 version. I
> compiled OpenMPI 1.5 with BLCR 0.8.2 and OFED 1.4.1 as follows:
> >
> > ./configure \
> > --with-ft=cr \
> > --enable-mpi-threads \
> > --with-blcr=/home/nguyen/opt/blcr \
> > --with-blcr-libdir=/home/nguyen/opt/blcr/lib \
> > --prefix=/home/nguyen/opt/openmpi-1.5 \
> > --with-openib \
> > --enable-mpirun-prefix-by-default
> >
> > For programs under "openmpi-1.5/examples" folder, mpirun tests were
> successful. But mpirun aborted immediately when running a program in MPI
> CUDA code, which was tested successfully with OpenMPI 1.4.3. Below is the
> error message.
> >
> > Can anyone give me an idea about this error?
> > Thank you.
> >
> > Best Regards,
> > Toan
> > --
> >
> >
> > [rc002.local:17727] [[56831,1],1] ORTE_ERROR_LOG: Data unpack would read
> past end of buffer in file util/nidmap.c at line 371
> >
> --
> > It looks like orte_init failed for some reason; your parallel process is
> > likely to abort.  There are many reasons that a parallel process can
> > fail during orte_init; some of which are due to configuration or
> > environment problems.  This failure appears to be an internal failure;
> > here's some additional information (which may only be relevant to an
> > Open MPI developer):
> >
> >   orte_ess_base_build_nidmap failed
> >   --> Returned value Data unpack would read past end of buffer (-26)
> instead of ORTE_SUCCESS
> >
> --
> > [rc002.local:17727] [[56831,1],1] ORTE_ERROR_LOG: Data unpack would read
> past end of buffer in file base/ess_base_nidmap.c at line 62
> > [rc002.local:17727] [[56831,1],1] ORTE_ERROR_LOG: Data unpack would read
> past end of buffer in file ess_env_module.c at line 173
> >
> --
> > It looks like orte_init failed for some reason; your parallel process is
> > likely to abort.  There are many reasons that a parallel process can
> > fail during orte_init; some of which are due to configuration or
> > environment problems.  This failure appears to be an internal failure;
> > here's some additional information (which may only be relevant to an
> > Open MPI developer):
> >
> >   orte_ess_set_name failed
> >   --> Returned value Data unpack would read past end of buffer (-26)
> instead of ORTE_SUCCESS
> >
> --
> > [rc002.local:17727] [[56831,1],1] ORTE_ERROR_LOG: Data unpack would read
> past end of buffer in file runtime/orte_init.c at line 132
> >
> --
> > It looks like MPI_INIT failed for some reason; your parallel process is
> > likely to abort.  There are many reasons that a parallel process can
> > fail during MPI_INIT; some of which are due to configuration or
> environment
> > problems.  This failure appears to be an internal failure; here's some
> > additional information (which may only be relevant to an Open MPI
> > developer):
> >
> >   ompi_mpi_init: orte_init failed
> >   --> Returned "Data unpack would read past end of buffer" (-26) instead
> of "Success" (0)
> >
> --
> > *** An error occurred in MPI_Init
> > *** before MPI was initialized
> > *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> > [rc002.local:17727] Abort before MPI_INIT completed successfully; not
> able to guarantee that all other processes were killed!
> >
> --
> > mpirun has exited due to process rank 1 with PID 17

[OMPI users] mpirun error in OpenMPI 1.5

2010-12-08 Thread Nguyen Toan
Dear all,

I am having a problem while running mpirun in OpenMPI 1.5 version. I
compiled OpenMPI 1.5 with BLCR 0.8.2 and OFED 1.4.1 as follows:

./configure \
--with-ft=cr \
--enable-mpi-threads \
--with-blcr=/home/nguyen/opt/blcr \
--with-blcr-libdir=/home/nguyen/opt/blcr/lib \
--prefix=/home/nguyen/opt/openmpi-1.5 \
--with-openib \
--enable-mpirun-prefix-by-default

For programs under "openmpi-1.5/examples" folder, mpirun tests were
successful. But mpirun aborted immediately when running a program in MPI
CUDA code, which was tested successfully with OpenMPI 1.4.3. Below is the
error message.

Can anyone give me an idea about this error?
Thank you.

Best Regards,
Toan
--


[rc002.local:17727] [[56831,1],1] ORTE_ERROR_LOG: Data unpack would read
past end of buffer in file util/nidmap.c at line 371
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_base_build_nidmap failed
  --> Returned value Data unpack would read past end of buffer (-26) instead
of ORTE_SUCCESS
--
[rc002.local:17727] [[56831,1],1] ORTE_ERROR_LOG: Data unpack would read
past end of buffer in file base/ess_base_nidmap.c at line 62
[rc002.local:17727] [[56831,1],1] ORTE_ERROR_LOG: Data unpack would read
past end of buffer in file ess_env_module.c at line 173
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_set_name failed
  --> Returned value Data unpack would read past end of buffer (-26) instead
of ORTE_SUCCESS
--
[rc002.local:17727] [[56831,1],1] ORTE_ERROR_LOG: Data unpack would read
past end of buffer in file runtime/orte_init.c at line 132
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: orte_init failed
  --> Returned "Data unpack would read past end of buffer" (-26) instead of
"Success" (0)
--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[rc002.local:17727] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
--
mpirun has exited due to process rank 1 with PID 17727 on
node rc002 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--


Re: [OMPI users] How to checkpoint atomic function in OpenMPI

2010-07-22 Thread Nguyen Toan
Dear Josh,
I hope to see this new API soon. Anyway, I will try these critical section
functions in BLCR. Thank you for the support.

Best Regards,
Nguyen Toan

On Sat, Jul 17, 2010 at 6:34 AM, Josh Hursey <jjhur...@open-mpi.org> wrote:

>
> On Jun 14, 2010, at 5:26 AM, Nguyen Toan wrote:
>
> > Hi all,
> > I have a MPI program as follows:
> > ---
> > int main(){
> >MPI_Init();
> >..
> >for (i=0; i<1; i++) {
> >   my_atomic_func();
> >}
> >...
> >MPI_Finalize();
> >return 0;
> > }
> > 
> >
> > The runtime of this program mainly involves in running the loop and
> my_atomic_func() takes a little bit long.
> > Here I want my_atomic_func() to be operated atomically, but the timing of
> checkpointing (by running ompi-checkpoint command) may be in the middle of
> my_atomic_func() operation and hence ompi-restart may fail to restart
> correctly.
> >
> > So my question is:
> > + At the checkpoint time (executing ompi-checkpoint), is there a way to
> let OpenMPI wait until my_atomic_func()  finishes its operation?
>
> We do not currently have an external function to declare a critical section
> during which a checkpoint should not be taken. I filed a ticket to make one
> available. The link is below if you would like to follow its progress:
>  https://svn.open-mpi.org/trac/ompi/ticket/2487
>
> I have an MPI Extension interface for C/R that I will be bringing into the
> trunk in the next few weeks. I should be able to extend it to include this
> feature. But I can't promise a deadline, just that I will update the ticket
> when it is available.
>
> In the mean time you might try to use the BLCR interface to define critical
> sections. If you are using the C/R thread then this may work (though I have
> not tried it):
>  cr_enter_cs()
>  cr_leave_cs()
>
> > + How does ompi-checkpoint operate to checkpoint MPI threads?
>
> We depend on the Checkpoint/Restart Service (e.g., BLCR) to capture the
> whole process image including all threads. So BLCR will capture the state of
> all threads when we take the process checkpoint.
>
> -- Josh
>
> >
> > Regards,
> > Nguyen Toan
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Question on checkpoint overhead in Open MPI

2010-07-22 Thread Nguyen Toan
Dear Josh,
Thank you very much for the reply. I am sorry if my question was unclear, so
please let me organize my question again.
Currently I am applying the staging technique with the mca-params.conf
setting as follows:
snapc_base_store_in_place=0  # enable remote file transfer to global storage
crs_base_snapshot_dir=/ssd/tmp/ckpt/local
snapc_base_global_snapshot_dir=/ssd/tmp/ckpt/global

and I am concerned with the amount "Others" = checkpoint latency -
checkpoint overhead.
According to your answer, remote file transfer is done asynchronously while
the application continues execution.
>From my observation the overhead of "Others" increases greatly when the data
size and the number of processes increases. So is the time of scp for file
transferring to stable storage included mainly in "Others"?
As you said the amount of checkpoint overhead is application and system
configuration specific but in general is there any relationship between
"Others" and the number of processes or data size?
Thank you.

Best Regards,
Nguyen Toan

On Sat, Jul 17, 2010 at 6:25 AM, Josh Hursey <jjhur...@open-mpi.org> wrote:

> The amount of checkpoint overhead is application and system configuration
> specific. So it is impossible to give you a good answer to how much
> checkpoint overhead to expect for your application and system setup.
>
> BLCR is only used to capture the single process image. The coordination of
> the distributed checkpoint includes:
>  - the time to initiate the checkpoint,
>  - the time to marshall the network (we currently use an all-to-all
> bookmark exchange, similar to to what LAM/MPI used),
>  - Store the local checkpoints to stable storage,
>  - Verify that all of the local checkpoints have been stored successfully,
> and
>  - Return the handle to the end user.
>
> The bulk of the time is spent saving the local checkpoints (a.k.a.
> snapshots) to stable storage. By default Open MPI saves directly to a
> globally mounted storage device. So the application is stalled until the
> checkpoint is complete (checkpoint overhead = checkpoint latency). You can
> also enable checkpoint staging in which the application saves the checkpoint
> to a local disk. After which the local daemon stages the file back to stable
> storage while the application continues execution (checkpoint overhead <<
> checkpoint latency).
>
> If you are concerned with scaling, definitely look at the staging
> technique.
>
> Does that help?
>
> -- Josh
>
> On Jul 7, 2010, at 12:25 PM, Nguyen Toan wrote:
>
> > Hello everyone,
> > I have a question concerning the checkpoint overhead in Open MPI, which
> is the difference taken from the runtime of application execution with and
> without checkpoint.
> > I observe that when the data size and the number of processes increases,
> the runtime of BLCR is very small compared to the overall checkpoint
> overhead in Open MPI. Is it because of the increase of coordination time for
> checkpoint? And what is included in the overall checkpoint overhead besides
> the BLCR's checkpoint overhead and coordination time?
> > Thank you.
> >
> > Best Regards,
> > Nguyen Toan
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Question on checkpoint overhead in Open MPI

2010-07-15 Thread Nguyen Toan
Somebody helps please? I am sorry to spam the mailing list but I really need
your help.
Thanks in advance.

Best Regards,
Nguyen Toan

On Thu, Jul 8, 2010 at 1:25 AM, Nguyen Toan <nguyentoan1...@gmail.com>wrote:

> Hello everyone,
> I have a question concerning the checkpoint overhead in Open MPI, which is
> the difference taken from the runtime of application execution with and
> without checkpoint.
> I observe that when the data size and the number of processes increases,
> the runtime of BLCR is very small compared to the overall checkpoint
> overhead in Open MPI. Is it because of the increase of coordination time for
> checkpoint? And what is included in the overall checkpoint overhead besides
> the BLCR's checkpoint overhead and coordination time?
> Thank you.
>
> Best Regards,
> Nguyen Toan
>


[OMPI users] Question on checkpoint overhead in Open MPI

2010-07-07 Thread Nguyen Toan
Hello everyone,
I have a question concerning the checkpoint overhead in Open MPI, which is
the difference taken from the runtime of application execution with and
without checkpoint.
I observe that when the data size and the number of processes increases, the
runtime of BLCR is very small compared to the overall checkpoint overhead in
Open MPI. Is it because of the increase of coordination time for checkpoint?
And what is included in the overall checkpoint overhead besides the BLCR's
checkpoint overhead and coordination time?
Thank you.

Best Regards,
Nguyen Toan


[OMPI users] How to checkpoint atomic function in OpenMPI

2010-06-14 Thread Nguyen Toan
Hi all,
I have a MPI program as follows:
---
int main(){
   MPI_Init();
   ..
   for (i=0; i<1; i++) {
  my_atomic_func();
   }
   ...
   MPI_Finalize();
   return 0;
}


The runtime of this program mainly involves in running the loop and
my_atomic_func() takes a little bit long.
Here I want my_atomic_func() to be operated atomically, but the timing of
checkpointing (by running ompi-checkpoint command) may be in the middle of
my_atomic_func() operation and hence ompi-restart may fail to restart
correctly.

So my question is:
+ At the checkpoint time (executing ompi-checkpoint), is there a way to let
OpenMPI wait until my_atomic_func()  finishes its operation?
+ How does ompi-checkpoint operate to checkpoint MPI threads?

Regards,
Nguyen Toan


Re: [OMPI users] ompi-restart failed

2010-06-14 Thread Nguyen Toan
Hi all,
I finally figured out the answer. I just put the parameter "-machinefile
host" in the "ompi-restart" command and it restarted correctly. So is it
unable to restart multi-threaded application on 1 node in OpenMPI?

Nguyen Toan

On Tue, Jun 8, 2010 at 12:07 AM, Nguyen Toan <nguyentoan1...@gmail.com>wrote:

> Sorry, I just want to add 2 more things:
> + I tried configure with and without --enable-ft-thread but nothing changed
> + I also applied this patch for OpenMPI here and reinstalled but I got the
> same error
>
> https://svn.open-mpi.org/trac/ompi/raw-attachment/ticket/2139/v1.4-preload-part1.diff
>
> Somebody helps? Thank you very much.
>
> Nguyen Toan
>
>
> On Mon, Jun 7, 2010 at 11:51 PM, Nguyen Toan <nguyentoan1...@gmail.com>wrote:
>
>> Hello everyone,
>>
>> I'm using OpenMPI 1.4.2 with BLCR 0.8.2 to test checkpointing on 2 nodes
>> but it failed to restart (Segmentation fault).
>> Here are the details concerning my problem:
>>
>> + OS: Centos 5.4
>> + OpenMPI configure:
>> ./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads \
>> --with-blcr=/home/nguyen/opt/blcr
>> --with-blcr-libdir=/home/nguyen/opt/blcr/lib \
>> --prefix=/home/nguyen/opt/openmpi \
>> --enable-mpirun-prefix-by-default
>> + mpirun -am ft-enable-cr -machinefile host ./test
>>
>> I checkpointed the test program using "ompi-checkpoint -v -s PID" and the
>> checkpoint file was created successfully. However it failed to restart using
>> ompi-restart:
>> *"mpirun noticed that process rank 0 with PID 21242 on node rc014.local
>> exited on signal 11 (Segmentation fault)"
>> *
>> Did I miss something in the installation of OpenMPI?
>>
>> Regards,
>> Nguyen Toan
>>
>
>


Re: [OMPI users] ompi-restart failed

2010-06-07 Thread Nguyen Toan
Sorry, I just want to add 2 more things:
+ I tried configure with and without --enable-ft-thread but nothing changed
+ I also applied this patch for OpenMPI here and reinstalled but I got the
same error
https://svn.open-mpi.org/trac/ompi/raw-attachment/ticket/2139/v1.4-preload-part1.diff

Somebody helps? Thank you very much.

Nguyen Toan

On Mon, Jun 7, 2010 at 11:51 PM, Nguyen Toan <nguyentoan1...@gmail.com>wrote:

> Hello everyone,
>
> I'm using OpenMPI 1.4.2 with BLCR 0.8.2 to test checkpointing on 2 nodes
> but it failed to restart (Segmentation fault).
> Here are the details concerning my problem:
>
> + OS: Centos 5.4
> + OpenMPI configure:
> ./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads \
> --with-blcr=/home/nguyen/opt/blcr
> --with-blcr-libdir=/home/nguyen/opt/blcr/lib \
> --prefix=/home/nguyen/opt/openmpi \
> --enable-mpirun-prefix-by-default
> + mpirun -am ft-enable-cr -machinefile host ./test
>
> I checkpointed the test program using "ompi-checkpoint -v -s PID" and the
> checkpoint file was created successfully. However it failed to restart using
> ompi-restart:
> *"mpirun noticed that process rank 0 with PID 21242 on node rc014.local
> exited on signal 11 (Segmentation fault)"
> *
> Did I miss something in the installation of OpenMPI?
>
> Regards,
> Nguyen Toan
>


[OMPI users] ompi-restart failed

2010-06-07 Thread Nguyen Toan
Hello everyone,

I'm using OpenMPI 1.4.2 with BLCR 0.8.2 to test checkpointing on 2 nodes but
it failed to restart (Segmentation fault).
Here are the details concerning my problem:

+ OS: Centos 5.4
+ OpenMPI configure:
./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads \
--with-blcr=/home/nguyen/opt/blcr
--with-blcr-libdir=/home/nguyen/opt/blcr/lib \
--prefix=/home/nguyen/opt/openmpi \
--enable-mpirun-prefix-by-default
+ mpirun -am ft-enable-cr -machinefile host ./test

I checkpointed the test program using "ompi-checkpoint -v -s PID" and the
checkpoint file was created successfully. However it failed to restart using
ompi-restart:
*"mpirun noticed that process rank 0 with PID 21242 on node rc014.local
exited on signal 11 (Segmentation fault)"
*
Did I miss something in the installation of OpenMPI?

Regards,
Nguyen Toan


Re: [OMPI users] OpenMPI Checkpoint/Restart is failed

2010-05-24 Thread Nguyen Toan
Hi all,

I had the same problem like Jitsumoto, i.e. OpenMPI 1.4.2 failed to restart
and the patch which Fernando gave didn't work.
I also tried 1.5 nightly snapshots but it seemed not working well.
For some purpose, I don't want to use --enable-ft-thread in configure but
the same error occurred even --enable-ft-thread is used.
Here is my configure for OMPI 1.5a1r23135:

>./configure \
>--with-ft=cr \
>--enable-mpi-threads \
>--with-blcr=/home/nguyen/opt/blcr
--with-blcr-libdir=/home/nguyen/opt/blcr/lib \
>--prefix=/home/nguyen/opt/openmpi_1.5 --enable-mpirun-prefix-by-default \

and errors:

>$ mpirun -am ft-enable-cr -machinefile ./host ./a.out
>0
>0
>1
>1
>2
>2
>3
>3
>--
>mpirun has exited due to process rank 1 with PID 6582 on
>node rc014 exiting improperly. There are two reasons this could occur:

>1. this process did not call "init" before exiting, but others in
>the job did. This can cause a job to hang indefinitely while it waits
>for all processes to call "init". By rule, if one process calls "init",
>then ALL processes must call "init" prior to termination.

>2. this process called "init", but exited without calling "finalize".
>By rule, all processes that call "init" MUST call "finalize" prior to
>exiting or it will be considered an "abnormal termination"

>This may have caused other processes in the application to be
>terminated by signals sent by mpirun (as reported here).
>---

And here is the checkpoint command:

>$ ompi-checkpoint -s -v --term 10982
>[rc013.local:11001] [  0.00 /   0.14] Requested - ...
>[rc013.local:11001] [  0.00 /   0.14]   Pending - ...
>[rc013.local:11001] [  0.01 /   0.15]   Running - ...
>[rc013.local:11001] [  7.79 /   7.94]  Finished -
>ompi_global_snapshot_10982.ckpt
>Snapshot Ref.:   0 ompi_global_snapshot_10982.ckpt

I also took a look inside the checkpoint files and found that the snapshot
was
taken: 
~/tmp/ckpt/ompi_global_snapshot_10982.ckpt/0/opal_snapshot_1.ckpt/ompi_blcr_context.6582

But restarting failed as follows:
>$ ompi-restart ompi_global_snapshot_10982.ckpt
>--
>mpirun noticed that process rank 1 with PID 11346 on node rc013.local
exited >on signal 11 (Segmentation fault).
>--

Is there any idea about this? Thank you!

Regards,
Nguyen Toan


On Mon, May 24, 2010 at 4:08 PM, Hideyuki Jitsumoto <
jitum...@gsic.titech.ac.jp> wrote:

> -- Forwarded message --
> From: Fernando Lemos <fernando...@gmail.com>
> Date: Thu, Apr 15, 2010 at 2:18 AM
> Subject: Re: [OMPI users] OpenMPI Checkpoint/Restart is failed
> To: Open MPI Users <us...@open-mpi.org>
>
>
> On Wed, Apr 14, 2010 at 5:25 AM, Hideyuki Jitsumoto
> <hjitsum...@gmail.com> wrote:
> > Fernando,
> >
> > Thank you for your reply.
> > I tried to patch the file you mentioned, but the output did not change.
>
> I didn't test the patch, tbh. I'm using 1.5 nightly snapshots, and it
> works great.
>
> >>Are you using a shared file system? You need to use a shared file
> > system for checkpointing with 1.4.1:
> > What is the shared file system ? do you mean NFS, Lustre and so on ?
> > (I'm sorry about my ignorance...)
>
> Something like NFS, yea.
>
> > If I use only one node for application, do I need such a
> shared-file-system ?
>
> No, for a single node, checkpointing with 1.4.1 should work (it works
> for me, at least). If you're using a single node, then your problem is
> probably not related to the bug report I posted.
>
>
> Regards,
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> --
> Sincerely Yours,
> Hideyuki Jitsumoto (jitum...@gsic.titech.ac.jp)
> Tokyo Institute of Technology
> Global Scientific Information and Computing center (Matsuoka Lab.)
>