Hi Josh, Thank you for the email. I can now checkpoint the application on the cluster using OPEN MPI. But I am now facing another problem.
When i tried restarting the checkpoint, nothing happens. I copied the checkpoint file to the $HOME directory and tried restarting it there and got the following error: - open('/var/cache/nscd/passwd', 0x0) failed: -13 - mmap failed: /var/cache/nscd/passwd - thaw_threads returned error, aborting. -13 - thaw_threads returned error, aborting. -13 - thaw_threads returned error, aborting. -13 Restart failed: Permission denied On my laptop it works fine. So, I am assuming its again something to do with my $HOME directory. Is it possible to restart the chekpoint from the /tmp directory itself without have to copy it back to the $HOME directory. I s there another way to compile and build openmpi so that everthing happens in the /tmp directory instead of the $HOME directory? Thank you Raj --- On Fri, 6/19/09, Josh Hursey <jjhur...@open-mpi.org> wrote: > From: Josh Hursey <jjhur...@open-mpi.org> > Subject: Re: [OMPI users] vfs_write returned -14 > To: "Open MPI Users" <us...@open-mpi.org> > Date: Friday, June 19, 2009, 2:48 PM > > On Jun 18, 2009, at 7:33 PM, Kritiraj Sajadah wrote: > > > > > Hello Josh, > > ThanK you > again for your respond. I tried chekpointing a > > simple c program using BLCR...and got the same error, > i.e: > > > > - vfs_write returned -14 > > - file_header: write returned -14 > > Checkpoint failed: Bad address > > So I would look at how your NFS file system is setup, and > work with > your sysadmin (and maybe the BLCR list) to resolve this > before > experimenting too much with checkpointing with Open MPI. > > > > > This is how i installed and run mpi programs for > checkpointing: > > > > 1) configure and install blcr > > 2) configure and install openmpi > > 3) Compile and run mpi program as follows: > > 4) To checkpoint the running program, > > 5) To restart your checkpoint, locate the checkpoint > file and type > > the following from the command line: > > > > This all looks ok to me. > > > The did another test with BLCR however, > > > > I tried checkpointing my c application from the /tmp > directory > > instead of my $HOME directory and it checkpointed > fine. > > > > So, it looks like the problem is with my $HOME > directory. > > > > I have "drwx" rights on my $HOME directory which seems > fine for me. > > > > Then i tried it with open MPI. However, with > open mpi the > > checkpoint file automatically get saved in the $HOME > directory. > > > > Is there a way to have the file saved in a different > location? I > > checked that LAM/MPI has some command line > options : > > > > $ mpirun -np 2 -ssi cr_base_dir /somewhere/else a.out > > > > Do we have a similar option for open mpi? > > By default Open MPI places the global snapshot in the $HOME > directory. > But you can also specify a different directory for the > global snapshot > using the following MCA option: > -mca snapc_base_global_snapshot_dir > /somewhere/else > > For the best results you will likely want to set this in > the MCA > params file in your home directory: > shell$ cat ~/.openmpi/mca-params.conf > snapc_base_global_snapshot_dir=/somewhere/else > > You can also stage the file to local disk, then have Open > MPI transfer > the checkpoints back to a {logically} central storage > device (both can > be /tmp on a local disk if you like). For more details on > this and the > above option you will want to read through the FT Users > Guide attached > to the wiki page at the link below: > https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR > > -- Josh > > > > > > > Thanks a lot > > > > regards, > > > > Raj > > > > --- On Wed, 6/17/09, Josh Hursey <jjhur...@open-mpi.org> > wrote: > > > >> From: Josh Hursey <jjhur...@open-mpi.org> > >> Subject: Re: [OMPI users] vfs_write returned -14 > >> To: "Open MPI Users" <us...@open-mpi.org> > >> Date: Wednesday, June 17, 2009, 1:42 AM > >> Did you try checkpointing a non-MPI > >> application with BLCR on the > >> cluster? If that does not work then I would > suspect that > >> BLCR is not > >> working properly on the system. > >> > >> However if a non-MPI application can be > checkpointed and > >> restarted > >> correctly on this machine then it may be something > odd with > >> the Open > >> MPI installation or runtime environment. To help > debug here > >> I would > >> need to know how Open MPI was configured and how > the > >> application was > >> ran on the machine (command line arguments, > environment > >> variables, ...). > >> > >> I should note that for the program that you sent > it is > >> important that > >> you compile Open MPI with the Fault Tolerance > Thread > >> enabled to ensure > >> a timely checkpoint. Otherwise the checkpoint will > be > >> delayed until > >> the MPI program enters the MPI_Finalize function. > >> > >> Let me know what you find out. > >> > >> Josh > >> > >> On Jun 16, 2009, at 5:08 PM, Kritiraj Sajadah > wrote: > >> > >>> > >>> Hi Josh, > >>> > >>> Thanks for the email. I have install BLCR > 0.8.1 and > >> openmpi 1.3 on > >>> my laptop with Ubuntu 8.04 on it. It works > fine. > >>> > >>> I now tried the installation on the cluster ( > on one > >> machine for > >>> now) in my university. ( the administrator > installed > >> it) i am not > >>> sure if he followed the steps i gave him. > >>> > >>> I am checkpointing a simple mpi application > which > >> looks as follows: > >>> > >>> #include <mpi.h> > >>> #include <stdio.h> > >>> > >>> int main(int argc, char **argv) > >>> { > >>> int rank,size; > >>> MPI_Init(&argc, &argv); > >>> MPI_Comm_rank(MPI_COMM_WORLD, &rank); > >>> MPI_Comm_size(MPI_COMM_WORLD, &size); > >>> printf("I am processor no %d of a total of %d > procs > >> \n", rank, size); > >>> system("sleep 30"); > >>> printf("I am processor no %d of a total of %d > procs > >> \n", rank, size); > >>> system("sleep 30"); > >>> printf("I am processor no %d of a total of %d > procs > >> \n", rank, size); > >>> system("sleep 30"); > >>> printf("bye \n"); > >>> MPI_Finalize(); > >>> return 0; > >>> } > >>> > >>> Do you think its better to re install BLCR? > >>> > >>> > >>> Thanks > >>> > >>> Raj > >>> --- On Tue, 6/16/09, Josh Hursey <jjhur...@open-mpi.org> > >> wrote: > >>> > >>>> From: Josh Hursey <jjhur...@open-mpi.org> > >>>> Subject: Re: [OMPI users] vfs_write > returned -14 > >>>> To: "Open MPI Users" <us...@open-mpi.org> > >>>> Date: Tuesday, June 16, 2009, 6:42 PM > >>>> > >>>> These are errors from BLCR. It may be a > problem > >> with your > >>>> BLCR installation and/or your application. > Are you > >> able to > >>>> checkpoint/restart a non-MPI application > with BLCR > >> on these > >>>> machines? > >>>> > >>>> What kind of MPI application are you > trying to > >> checkpoint? > >>>> Some of the MPI interfaces are not fully > supported > >> at the > >>>> moment (outlined in the FT User Document > that I > >> mentioned in > >>>> a previous email). > >>>> > >>>> -- Josh > >>>> > >>>> On Jun 16, 2009, at 11:30 AM, Kritiraj > Sajadah > >> wrote: > >>>> > >>>>> > >>>>> Dear All, > >>>>> > I > >> have install > >>>> openmpi 1.3 and blcr 0.8.1 on a linux > machine > >> (ubuntu). > >>>> however, when i try checkpointing an MPI > >> application, I get > >>>> the following error: > >>>>> > >>>>> - vfs_write returned -14 > >>>>> - file_header: write returned -14 > >>>>> > >>>>> Can someone help please. > >>>>> > >>>>> Regards, > >>>>> > >>>>> Raj > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >> _______________________________________________ > >>>>> users mailing list > >>>>> us...@open-mpi.org > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> > >>>> > _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> > >>> > >>> > >>> > >>> > >>> > _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > > > > > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >