Hello Josh,
           ThanK you again for your respond. I tried chekpointing a simple c 
program using BLCR...and got the same error, i.e:

- vfs_write returned -14
- file_header: write returned -14
Checkpoint failed: Bad address


This is how i installed and run mpi programs for checkpointing:

1) configure and install blcr

tar zxf blcr-<VERSION>.tar.gz
cd blcr-<VERSION>
mkdir builddir
cd builddir

../configure --prefix=/usr/local/ --enable-debug=yes --enable-libcr-tracing=yes 
--enable-kernel-tracing=yes --enable-testsuite=yes --enable-all-static=yes 
--enable-static=yes

make
make install

2) configure and install openmpi

./configure --prefix=/usr/local/ --enable-picky --enable-debug 
--enable-mpi-profile --enable-mpi-cxx --enable-pretty-print-stacktrace 
--enable-binaries --enable-trace --enable-static=yes --enable-debug 
--with-devel-headers=1 --with-mpi-param-check=always --with-ft=cr 
--enable-ft-thread --with-blcr=/usr/local/ --with-blcr-libdir=/usr/local/lib 
--enable-mpi-threads=yes

make all install

3)  Compile and run mpi program as follows:

     raj> mpicc helloworld.c -o helloworld
     raj> mpirun -am ft-enable-cr helloworld

4) To checkpoint the running program,

         raj>  ompi-checkpoint [any option] pid 
         for example:   ompi-checkpoint -v 11527

5) To restart your checkpoint, locate the checkpoint file and type the 
following from the command line:

          raj> mpi-restart ompi_global_snapshot_XXXX.ckpt


The did another test with BLCR however,

I tried checkpointing my c application from the /tmp directory instead of my 
$HOME directory and it checkpointed fine.

So, it looks like the problem is with my $HOME directory.

I have "drwx" rights on my $HOME directory which seems fine for me.

Then i tried it with open MPI.  However, with open mpi the checkpoint file 
automatically get saved in the $HOME directory. 

Is there a way to have the file saved in a different location? I checked that 
LAM/MPI has some command line  options :

$ mpirun -np 2 -ssi cr_base_dir /somewhere/else a.out

Do we have a similar option for open mpi?

Thanks a lot

regards,

Raj

--- On Wed, 6/17/09, Josh Hursey <jjhur...@open-mpi.org> wrote:

> From: Josh Hursey <jjhur...@open-mpi.org>
> Subject: Re: [OMPI users] vfs_write returned -14
> To: "Open MPI Users" <us...@open-mpi.org>
> Date: Wednesday, June 17, 2009, 1:42 AM
> Did you try checkpointing a non-MPI
> application with BLCR on the  
> cluster? If that does not work then I would suspect that
> BLCR is not  
> working properly on the system.
> 
> However if a non-MPI application can be checkpointed and
> restarted  
> correctly on this machine then it may be something odd with
> the Open  
> MPI installation or runtime environment. To help debug here
> I would  
> need to know how Open MPI was configured and how the
> application was  
> ran on the machine (command line arguments, environment
> variables, ...).
> 
> I should note that for the program that you sent it is
> important that  
> you compile Open MPI with the Fault Tolerance Thread
> enabled to ensure  
> a timely checkpoint. Otherwise the checkpoint will be
> delayed until  
> the MPI program enters the MPI_Finalize function.
> 
> Let me know what you find out.
> 
> Josh
> 
> On Jun 16, 2009, at 5:08 PM, Kritiraj Sajadah wrote:
> 
> >
> > Hi Josh,
> >
> > Thanks for the email. I have install BLCR 0.8.1 and
> openmpi 1.3 on  
> > my laptop with Ubuntu 8.04 on it. It works fine.
> >
> > I now tried the installation on the cluster ( on one
> machine for  
> > now) in my university. ( the administrator installed
> it) i am not  
> > sure if he followed the steps i gave him.
> >
> > I am checkpointing a simple mpi application which
> looks as follows:
> >
> > #include <mpi.h>
> > #include <stdio.h>
> >
> > int main(int argc, char **argv)
> > {
> > int rank,size;
> > MPI_Init(&argc, &argv);
> > MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> > MPI_Comm_size(MPI_COMM_WORLD, &size);
> > printf("I am processor no %d of a total of %d procs
> \n", rank, size);
> > system("sleep 30");
> > printf("I am processor no %d of a total of %d procs
> \n", rank, size);
> > system("sleep 30");
> > printf("I am processor no %d of a total of %d procs
> \n", rank, size);
> > system("sleep 30");
> > printf("bye \n");
> > MPI_Finalize();
> > return 0;
> > }
> >
> > Do you think its better to re install BLCR?
> >
> >
> > Thanks
> >
> > Raj
> > --- On Tue, 6/16/09, Josh Hursey <jjhur...@open-mpi.org>
> wrote:
> >
> >> From: Josh Hursey <jjhur...@open-mpi.org>
> >> Subject: Re: [OMPI users] vfs_write returned -14
> >> To: "Open MPI Users" <us...@open-mpi.org>
> >> Date: Tuesday, June 16, 2009, 6:42 PM
> >>
> >> These are errors from BLCR. It may be a problem
> with your
> >> BLCR installation and/or your application. Are you
> able to
> >> checkpoint/restart a non-MPI application with BLCR
> on these
> >> machines?
> >>
> >> What kind of MPI application are you trying to
> checkpoint?
> >> Some of the MPI interfaces are not fully supported
> at the
> >> moment (outlined in the FT User Document that I
> mentioned in
> >> a previous email).
> >>
> >> -- Josh
> >>
> >> On Jun 16, 2009, at 11:30 AM, Kritiraj Sajadah
> wrote:
> >>
> >>>
> >>> Dear All,
> >>>           I
> have install
> >> openmpi 1.3 and blcr 0.8.1 on a linux machine
> (ubuntu).
> >> however, when i try checkpointing an MPI
> application, I get
> >> the following error:
> >>>
> >>> - vfs_write returned -14
> >>> - file_header: write returned -14
> >>>
> >>> Can someone help please.
> >>>
> >>> Regards,
> >>>
> >>> Raj
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> _______________________________________________
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >> _______________________________________________
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >
> >
> >
> >
> > _______________________________________________
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 




Reply via email to