Hello Josh, ThanK you again for your respond. I tried chekpointing a simple c program using BLCR...and got the same error, i.e:
- vfs_write returned -14 - file_header: write returned -14 Checkpoint failed: Bad address This is how i installed and run mpi programs for checkpointing: 1) configure and install blcr tar zxf blcr-<VERSION>.tar.gz cd blcr-<VERSION> mkdir builddir cd builddir ../configure --prefix=/usr/local/ --enable-debug=yes --enable-libcr-tracing=yes --enable-kernel-tracing=yes --enable-testsuite=yes --enable-all-static=yes --enable-static=yes make make install 2) configure and install openmpi ./configure --prefix=/usr/local/ --enable-picky --enable-debug --enable-mpi-profile --enable-mpi-cxx --enable-pretty-print-stacktrace --enable-binaries --enable-trace --enable-static=yes --enable-debug --with-devel-headers=1 --with-mpi-param-check=always --with-ft=cr --enable-ft-thread --with-blcr=/usr/local/ --with-blcr-libdir=/usr/local/lib --enable-mpi-threads=yes make all install 3) Compile and run mpi program as follows: raj> mpicc helloworld.c -o helloworld raj> mpirun -am ft-enable-cr helloworld 4) To checkpoint the running program, raj> ompi-checkpoint [any option] pid for example: ompi-checkpoint -v 11527 5) To restart your checkpoint, locate the checkpoint file and type the following from the command line: raj> mpi-restart ompi_global_snapshot_XXXX.ckpt The did another test with BLCR however, I tried checkpointing my c application from the /tmp directory instead of my $HOME directory and it checkpointed fine. So, it looks like the problem is with my $HOME directory. I have "drwx" rights on my $HOME directory which seems fine for me. Then i tried it with open MPI. However, with open mpi the checkpoint file automatically get saved in the $HOME directory. Is there a way to have the file saved in a different location? I checked that LAM/MPI has some command line options : $ mpirun -np 2 -ssi cr_base_dir /somewhere/else a.out Do we have a similar option for open mpi? Thanks a lot regards, Raj --- On Wed, 6/17/09, Josh Hursey <jjhur...@open-mpi.org> wrote: > From: Josh Hursey <jjhur...@open-mpi.org> > Subject: Re: [OMPI users] vfs_write returned -14 > To: "Open MPI Users" <us...@open-mpi.org> > Date: Wednesday, June 17, 2009, 1:42 AM > Did you try checkpointing a non-MPI > application with BLCR on the > cluster? If that does not work then I would suspect that > BLCR is not > working properly on the system. > > However if a non-MPI application can be checkpointed and > restarted > correctly on this machine then it may be something odd with > the Open > MPI installation or runtime environment. To help debug here > I would > need to know how Open MPI was configured and how the > application was > ran on the machine (command line arguments, environment > variables, ...). > > I should note that for the program that you sent it is > important that > you compile Open MPI with the Fault Tolerance Thread > enabled to ensure > a timely checkpoint. Otherwise the checkpoint will be > delayed until > the MPI program enters the MPI_Finalize function. > > Let me know what you find out. > > Josh > > On Jun 16, 2009, at 5:08 PM, Kritiraj Sajadah wrote: > > > > > Hi Josh, > > > > Thanks for the email. I have install BLCR 0.8.1 and > openmpi 1.3 on > > my laptop with Ubuntu 8.04 on it. It works fine. > > > > I now tried the installation on the cluster ( on one > machine for > > now) in my university. ( the administrator installed > it) i am not > > sure if he followed the steps i gave him. > > > > I am checkpointing a simple mpi application which > looks as follows: > > > > #include <mpi.h> > > #include <stdio.h> > > > > int main(int argc, char **argv) > > { > > int rank,size; > > MPI_Init(&argc, &argv); > > MPI_Comm_rank(MPI_COMM_WORLD, &rank); > > MPI_Comm_size(MPI_COMM_WORLD, &size); > > printf("I am processor no %d of a total of %d procs > \n", rank, size); > > system("sleep 30"); > > printf("I am processor no %d of a total of %d procs > \n", rank, size); > > system("sleep 30"); > > printf("I am processor no %d of a total of %d procs > \n", rank, size); > > system("sleep 30"); > > printf("bye \n"); > > MPI_Finalize(); > > return 0; > > } > > > > Do you think its better to re install BLCR? > > > > > > Thanks > > > > Raj > > --- On Tue, 6/16/09, Josh Hursey <jjhur...@open-mpi.org> > wrote: > > > >> From: Josh Hursey <jjhur...@open-mpi.org> > >> Subject: Re: [OMPI users] vfs_write returned -14 > >> To: "Open MPI Users" <us...@open-mpi.org> > >> Date: Tuesday, June 16, 2009, 6:42 PM > >> > >> These are errors from BLCR. It may be a problem > with your > >> BLCR installation and/or your application. Are you > able to > >> checkpoint/restart a non-MPI application with BLCR > on these > >> machines? > >> > >> What kind of MPI application are you trying to > checkpoint? > >> Some of the MPI interfaces are not fully supported > at the > >> moment (outlined in the FT User Document that I > mentioned in > >> a previous email). > >> > >> -- Josh > >> > >> On Jun 16, 2009, at 11:30 AM, Kritiraj Sajadah > wrote: > >> > >>> > >>> Dear All, > >>> I > have install > >> openmpi 1.3 and blcr 0.8.1 on a linux machine > (ubuntu). > >> however, when i try checkpointing an MPI > application, I get > >> the following error: > >>> > >>> - vfs_write returned -14 > >>> - file_header: write returned -14 > >>> > >>> Can someone help please. > >>> > >>> Regards, > >>> > >>> Raj > >>> > >>> > >>> > >>> > >>> > >>> > _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > > > > > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >