[OMPI users] Fw: Problem with checkpointing multihosts, multiprocesses MPI application
HI Averyone, Happy new year 2010. A few weeks ago I posted a query (please see email below) regarding checkpointing applications running on multiple hosts. I am still struggling to find a solution. I would really appreciate if someone could help me. Thank you. Raj --- On Sat, 12/12/09, Kritiraj Sajadah wrote: > From: Kritiraj Sajadah > Subject: Problem with checkpointing multihosts, multiprocesses MPI application > To: us...@open-mpi.org > Date: Saturday, December 12, 2009, 3:03 PM > Dear All, > I am trying to > checkpoint am MPI application which has two processes each > running on two seperate hosts. > > I run the application as follows: > > raj@sun32:~$ mpirun -am ft-enable-cr -np 2 --hostfile > sunhost -mca btl ^openib -mca snapc_base_global_snapshot_dir > /tmp m. > > and I trigger the checkpoint as follows: > > raj@sun32:~$ ompi-checkpoint -v 30010 > > > The following happens displaying two errors which > checkpointng the application: > > > ## > I am processor no 0 of a total of 2 procs on host sun32 > I am processor no 1 of a total of 2 procs on host sun06 > I am processo no 0 of a total of 2 procs on host > sun32 > I am processo no 1 of a total of 2 procs on host > sun06 > > [sun32:30010] Error: expected_component: PID information > unavailable! > [sun32:30010] Error: expected_component: Component Name > information unavailable! > > I am proceor no 1 of a total of 2 procs on host > sun06 > I am proceor no 0 of a total of 2 procs on host > sun32 > bye > bye > > > > > > when I try to restart the application from the checkpointed > file, I get the following: > > raj@sun32:~$ ompi-restart ompi_global_snapshot_30010.ckpt > -- > Error: The filename (opal_snapshot_1.ckpt) is invalid > because either you have not provided a filename > or provided an invalid > filename. > Please see --help for > usage. > > -- > I am proceor no 0 of a total of 2 procs on host > sun32 > bye > > > I would very appreciate if you could give me some ideas on > how to checkpoint and restart MPI application running on > multiple hosts. > > Thank you > > Regards, > > Raj > > > >
[OMPI users] problem restarting multiprocess mpi application
Dear All, I am running a simple mpi application which looks as follows: ## #include #include #include #include #include int main(int argc, char **argv) { int rank,size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Hello\n"); sleep(15); printf("Hello again\n" ); sleep(15); printf("Final Hello\n"); sleep(15); printf("bye \n"); MPI_Finalize(); return 0; } # When I run my application as follows, it checkpoint correctly but when i try to restart it if gives the following errors: ## ompi-restart ompi_global_snapshot_380.ckpt Hello again [sun06:00381] *** Process received signal *** [sun06:00381] Signal: Bus error (7) [sun06:00381] Signal code: (2) [sun06:00381] Failing at address: 0xae7cb054 [sun06:00381] [ 0] [0xb7f8640c] [sun06:00381] [ 1] /home/raj/openmpisof/lib/libopen-pal.so.0(opal_progress+0x123) [0xb7b95456] [sun06:00381] [ 2] /home/raj/openmpisof/lib/libopen-pal.so.0 [0xb7bcb093] [sun06:00381] [ 3] /home/raj/openmpisof/lib/libopen-pal.so.0 [0xb7bcae97] [sun06:00381] [ 4] /home/raj/openmpisof/lib/libopen-pal.so.0(opal_crs_blcr_checkpoint+0x187) [0xb7bca69b] [sun06:00381] [ 5] /home/raj/openmpisof/lib/libopen-pal.so.0(opal_cr_inc_core+0xc3) [0xb7b970bd] [sun06:00381] [ 6] /home/raj/openmpisof/lib/libopen-rte.so.0 [0xb7cab06f] [sun06:00381] [ 7] /home/raj/openmpisof/lib/libopen-pal.so.0(opal_cr_test_if_checkpoint_ready+0x129) [0xb7b96fca] [sun06:00381] [ 8] /home/raj/openmpisof/lib/libopen-pal.so.0 [0xb7b97698] [sun06:00381] [ 9] /lib/libpthread.so.0 [0xb7ac4f3b] [sun06:00381] [10] /lib/libc.so.6(clone+0x5e) [0xb7a4bbee] [sun06:00381] *** End of error message *** -- mpirun noticed that process rank 0 with PID 399 on node sun06 exited on signal 7 (Bus error). -- # I am running it as follows: mpirun -am ft-enable-cr -np 2 -mca btl ^openib -mca snapc_base_global_snapshot_dir /tmp mpisleepbas. Once a checkpoint it taken, I have to copy it to the home directory and try to restart it. please not that if i used - np 1, it works fine when i restart it. The problem is mainly when the application has more than one process running. Any help will be very appreciated Raj
[OMPI users] Problem with checkpointing multihosts, multiprocesses MPI application
Dear All, I am trying to checkpoint am MPI application which has two processes each running on two seperate hosts. I run the application as follows: raj@sun32:~$ mpirun -am ft-enable-cr -np 2 --hostfile sunhost -mca btl ^openib -mca snapc_base_global_snapshot_dir /tmp m. and I trigger the checkpoint as follows: raj@sun32:~$ ompi-checkpoint -v 30010 The following happens displaying two errors which checkpointng the application: ## I am processor no 0 of a total of 2 procs on host sun32 I am processor no 1 of a total of 2 procs on host sun06 I am processo no 0 of a total of 2 procs on host sun32 I am processo no 1 of a total of 2 procs on host sun06 [sun32:30010] Error: expected_component: PID information unavailable! [sun32:30010] Error: expected_component: Component Name information unavailable! I am proceor no 1 of a total of 2 procs on host sun06 I am proceor no 0 of a total of 2 procs on host sun32 bye bye when I try to restart the application from the checkpointed file, I get the following: raj@sun32:~$ ompi-restart ompi_global_snapshot_30010.ckpt -- Error: The filename (opal_snapshot_1.ckpt) is invalid because either you have not provided a filename or provided an invalid filename. Please see --help for usage. -- I am proceor no 0 of a total of 2 procs on host sun32 bye I would very appreciate if you could give me some ideas on how to checkpoint and restart MPI application running on multiple hosts. Thank you Regards, Raj
[OMPI users] a good grid simulator to run open MPI applications
Hi All, Can you recommend me a good open source Grid simulation tool to execute open mpi applcaiton. Thanks Raj
[OMPI users] get the process Id of mpirun
Dear All, I am trying to get the process Id of Mpirun from within my MPI application. When i use getpid() and getppid(), i get the PID of my application and the PID of "orted --daemonize -mca..." respectively. Is there a way to get the PID of the mpirun? In this case, it looks like it is the grandparent of the application. Thank you Regards, Raj
[OMPI users] mpirun noticed that process rank 1 ... exited on signal 13 (Broken pipe).
Hi Everyone, I have install openmpi 1.3 and blcr 0.81 on my laptop (single processor). I am trying to checkpoint a small test application: ### #include #include #include #include #include int main(int argc, char **argv) { int rank,size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("I am processor no %d of a total of %d procs \n", rank, size); system("sleep 10"); printf("I am processor no %d of a total of %d procs \n", rank, size); system("sleep 10"); printf("I am processor no %d of a total of %d procs \n", rank, size); system("sleep 10"); printf("mpisleep bye \n"); MPI_Finalize(); return 0; } ### I compile it as follows: mpicc mpisleep.c -o mpisleep and i run it as follows: mpirun -am ft-enable-cr -np 2 mpisleep. When i try checkpointing ( ompi-checkpoint -v 8118) it, it checkpoints fine but when i restart it, i get the following: I am processor no 0 of a total of 2 procs I am processor no 1 of a total of 2 procs mpisleep bye -- mpirun noticed that process rank 1 with PID 8118 on node raj-laptop exited on signal 13 (Broken pipe). -- Any suggestions is very much appreciated Raj
[OMPI users] problem using openmpi with DMTCP
Dear All, I am trying to integrate DMTCP with openmpi. IF I run a c application, it works fine. But when I execute the program using mpirun, It checkpoints application but gives error when restarting the application. # [31007] WARNING at connection.cpp:303 in restore; REASON='JWARNING((_sockDomain == AF_INET || _sockDomain == AF_UNIX ) && _sockType == SOCK_STREAM) failed' id() = 2ab3f248-30933-4ac0d75a(99007) _sockDomain = 10 _sockType = 1 _sockProtocol = 0 Message: socket type not yet [fully] supported [31007] WARNING at connection.cpp:303 in restore; REASON='JWARNING((_sockDomain == AF_INET || _sockDomain == AF_UNIX ) && _sockType == SOCK_STREAM) failed' id() = 2ab3f248-30943-4ac0d75c(99007) _sockDomain = 10 _sockType = 1 _sockProtocol = 0 Message: socket type not yet [fully] supported [31013] WARNING at connection.cpp:87 in restartDup2; REASON='JWARNING(_real_dup2 ( oldFd, fd ) == fd) failed' oldFd = 537 fd = 1 (strerror((*__errno_location ( = Bad file descriptor [31013] WARNING at connectionmanager.cpp:627 in closeAll; REASON='JWARNING(_real_close ( i->second ) ==0) failed' i->second = 537 (strerror((*__errno_location ( = Bad file descriptor [31015] WARNING at connectionmanager.cpp:627 in closeAll; REASON='JWARNING(_real_close ( i->second ) ==0) failed' i->second = 537 (strerror((*__errno_location ( = Bad file descriptor [31017] WARNING at connectionmanager.cpp:627 in closeAll; REASON='JWARNING(_real_close ( i->second ) ==0) failed' i->second = 537 (strerror((*__errno_location ( = Bad file descriptor [31007] WARNING at connectionmanager.cpp:627 in closeAll; REASON='JWARNING(_real_close ( i->second ) ==0) failed' i->second = 537 (strerror((*__errno_location ( = Bad file descriptor MTCP: mtcp_restart_nolibc: mapping current version of /usr/lib/gconv/gconv-modules.cache into memory; _not_ file as it existed at time of checkpoint. Change mtcp_restart_nolibc.c:634 and re-compile, if you want different behavior. [31015] ERROR at connection.cpp:372 in restoreOptions; REASON='JASSERT(ret == 0) failed' (strerror((*__errno_location ( = Invalid argument fds[0] = 6 opt->first = 26 opt->second.size() = 4 Message: restoring setsockopt failed Terminating... # Any suggestions is very welcomed. regards, Raj
Re: [OMPI users] configure OPENMPI with DMTCP
Hi Josh, I can't access the link you gave. Its a secure link and I think needs authentication. Thanks Raj --- On Thu, 8/13/09, Josh Hursey wrote: > From: Josh Hursey > Subject: Re: [OMPI users] configure OPENMPI with DMTCP > To: "Open MPI Users" > Date: Thursday, August 13, 2009, 2:40 PM > > On Aug 12, 2009, at 3:35 PM, Kritiraj Sajadah wrote: > > > HI, > > I want to configure OPENMPI to > checkpoint MPI applications using DMTCP. Does anyone know > how to specify the path to the DMTCP application when > installing OPENMPI. > > I have not experimented with Open MPI using DMTCP. If I > understand their website and papers correctly, DMTCP can > work with Open MPI without modification (though I do not > know to what degree of coverage), so you -should- not need > specify anything when building Open MPI. > > > > > Also, I wanted to use OPENMPI with SELF instead of > BLCR. Is there any guide for setting up OPENMPI with SELF? > > There are instructions for this in the Checkpoint/Restart > User's Guide posted to the Open MPI wiki: > https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR > > -- Josh > > > > > Thanks a lot. > > > > Regards, > > > > Raj > > > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
[OMPI users] configure OPENMPI with DMTCP
HI, I want to configure OPENMPI to checkpoint MPI applications using DMTCP. Does anyone know how to specify the path to the DMTCP application when installing OPENMPI. Also, I wanted to use OPENMPI with SELF instead of BLCR. Is there any guide for setting up OPENMPI with SELF? Thanks a lot. Regards, Raj
Re: [OMPI users] Checkpointing automatically at regular intervals
Dear Josh, I am sure it will definitely be good because if someone is using OPEN MPI for checkpointing his application, he will not want to sit and checkpoint the application manually; and this can be a real pain if its a long running application. I would imagine an automatic restart from the last checkpoint in case of failure would also be interesting. Many thanks. Regards, Kritiraj --- On Tue, 6/30/09, Josh Hursey wrote: > From: Josh Hursey > Subject: Re: [OMPI users] Checkpointing automatically at regular intervals > To: "Open MPI Users" > Date: Tuesday, June 30, 2009, 3:00 PM > Currently, there is no mechanism to > checkpoint every X minutes in Open MPI. > > As mentioned below you can use a script to initiate the > checkpoint every X minutes. Alternatively it should not be > too difficult to add such a feature to Open MPI. If enough > people would be interested I can file a feature bug to add > such a feature in a future release. > > Josh > > On Jun 30, 2009, at 9:34 AM, Mohamed Slim bouguerra wrote: > > > Hi, > > I think that you can write a simple script such as: > > > > wihle `pgrep mpirun` != "" > > ompi-checkpoint `pidof mpirun` > > sleep 5 > > done > > > > Le 30 juin 09 à 14:29, Kritiraj Sajadah a écrit : > > > >> > >> Dear All, > >> I can manually > checkpoint an MPI application using OPEN MPI and BLCR. > However, I now want to checkpointing my application > automatically at every 5 minutes. Is there a way in OPEN MPI > to ensure automatic checkpointing without the user > intervention while the application is running? > >> > >> Thank you > >> > >> Regards, > >> Kritiraj > >> > >> > >> > >> ___ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] Apllication level checkpointing tools.
Dear Mohamed, Thank you for the link Regards, Raj --- On Tue, 6/30/09, Mohamed Slim bouguerra wrote: > From: Mohamed Slim bouguerra > Subject: Re: [OMPI users] Apllication level checkpointing tools. > To: "Open MPI Users" > Date: Tuesday, June 30, 2009, 1:09 PM > Dear Kritiraj, > You can use DMTCP http://sourceforge.net/projects/dmtcp > > Le 30 juin 09 à 13:59, Kritiraj Sajadah a écrit : > > > > > Daer All, > > I have successfully > comfigure OPENMPI with BLCR and id some test. hover, i now > want to do some testing with an Application Level > checkpointng tools. I tried using libckpt but could > not install it. > > > > Do anyone of you know any open source application > level checkpointing tools available that i can install and > test with openmpi? > > > > Thank you > > > > Regards, > > > > Raj > > > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
[OMPI users] Checkpointing automatically at regular intervals
Dear All, I can manually checkpoint an MPI application using OPEN MPI and BLCR. However, I now want to checkpointing my application automatically at every 5 minutes. Is there a way in OPEN MPI to ensure automatic checkpointing without the user intervention while the application is running? Thank you Regards, Kritiraj
[OMPI users] Apllication level checkpointing tools.
Daer All, I have successfully comfigure OPENMPI with BLCR and id some test. hover, i now want to do some testing with an Application Level checkpointng tools. I tried using libckpt but could not install it. Do anyone of you know any open source application level checkpointing tools available that i can install and test with openmpi? Thank you Regards, Raj
Re: [OMPI users] vfs_write returned -14
Hi Josh, Thank you for the email. I can now checkpoint the application on the cluster using OPEN MPI. But I am now facing another problem. When i tried restarting the checkpoint, nothing happens. I copied the checkpoint file to the $HOME directory and tried restarting it there and got the following error: - open('/var/cache/nscd/passwd', 0x0) failed: -13 - mmap failed: /var/cache/nscd/passwd - thaw_threads returned error, aborting. -13 - thaw_threads returned error, aborting. -13 - thaw_threads returned error, aborting. -13 Restart failed: Permission denied On my laptop it works fine. So, I am assuming its again something to do with my $HOME directory. Is it possible to restart the chekpoint from the /tmp directory itself without have to copy it back to the $HOME directory. I s there another way to compile and build openmpi so that everthing happens in the /tmp directory instead of the $HOME directory? Thank you Raj --- On Fri, 6/19/09, Josh Hursey wrote: > From: Josh Hursey > Subject: Re: [OMPI users] vfs_write returned -14 > To: "Open MPI Users" > Date: Friday, June 19, 2009, 2:48 PM > > On Jun 18, 2009, at 7:33 PM, Kritiraj Sajadah wrote: > > > > > Hello Josh, > > ThanK you > again for your respond. I tried chekpointing a > > simple c program using BLCR...and got the same error, > i.e: > > > > - vfs_write returned -14 > > - file_header: write returned -14 > > Checkpoint failed: Bad address > > So I would look at how your NFS file system is setup, and > work with > your sysadmin (and maybe the BLCR list) to resolve this > before > experimenting too much with checkpointing with Open MPI. > > > > > This is how i installed and run mpi programs for > checkpointing: > > > > 1) configure and install blcr > > 2) configure and install openmpi > > 3) Compile and run mpi program as follows: > > 4) To checkpoint the running program, > > 5) To restart your checkpoint, locate the checkpoint > file and type > > the following from the command line: > > > > This all looks ok to me. > > > The did another test with BLCR however, > > > > I tried checkpointing my c application from the /tmp > directory > > instead of my $HOME directory and it checkpointed > fine. > > > > So, it looks like the problem is with my $HOME > directory. > > > > I have "drwx" rights on my $HOME directory which seems > fine for me. > > > > Then i tried it with open MPI. However, with > open mpi the > > checkpoint file automatically get saved in the $HOME > directory. > > > > Is there a way to have the file saved in a different > location? I > > checked that LAM/MPI has some command line > options : > > > > $ mpirun -np 2 -ssi cr_base_dir /somewhere/else a.out > > > > Do we have a similar option for open mpi? > > By default Open MPI places the global snapshot in the $HOME > directory. > But you can also specify a different directory for the > global snapshot > using the following MCA option: > -mca snapc_base_global_snapshot_dir > /somewhere/else > > For the best results you will likely want to set this in > the MCA > params file in your home directory: > shell$ cat ~/.openmpi/mca-params.conf > snapc_base_global_snapshot_dir=/somewhere/else > > You can also stage the file to local disk, then have Open > MPI transfer > the checkpoints back to a {logically} central storage > device (both can > be /tmp on a local disk if you like). For more details on > this and the > above option you will want to read through the FT Users > Guide attached > to the wiki page at the link below: > https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR > > -- Josh > > > > > > > Thanks a lot > > > > regards, > > > > Raj > > > > --- On Wed, 6/17/09, Josh Hursey > wrote: > > > >> From: Josh Hursey > >> Subject: Re: [OMPI users] vfs_write returned -14 > >> To: "Open MPI Users" > >> Date: Wednesday, June 17, 2009, 1:42 AM > >> Did you try checkpointing a non-MPI > >> application with BLCR on the > >> cluster? If that does not work then I would > suspect that > >> BLCR is not > >> working properly on the system. > >> > >> However if a non-MPI application can be > checkpointed and > >> restarted > >> correctly on this machine then it may be something > odd with > >> the Open > >> MPI installation or runtime environment. To help > debug here > >> I woul
Re: [OMPI users] vfs_write returned -14
Hello Josh, ThanK you again for your respond. I tried chekpointing a simple c program using BLCR...and got the same error, i.e: - vfs_write returned -14 - file_header: write returned -14 Checkpoint failed: Bad address This is how i installed and run mpi programs for checkpointing: 1) configure and install blcr tar zxf blcr-.tar.gz cd blcr- mkdir builddir cd builddir ../configure --prefix=/usr/local/ --enable-debug=yes --enable-libcr-tracing=yes --enable-kernel-tracing=yes --enable-testsuite=yes --enable-all-static=yes --enable-static=yes make make install 2) configure and install openmpi ./configure --prefix=/usr/local/ --enable-picky --enable-debug --enable-mpi-profile --enable-mpi-cxx --enable-pretty-print-stacktrace --enable-binaries --enable-trace --enable-static=yes --enable-debug --with-devel-headers=1 --with-mpi-param-check=always --with-ft=cr --enable-ft-thread --with-blcr=/usr/local/ --with-blcr-libdir=/usr/local/lib --enable-mpi-threads=yes make all install 3) Compile and run mpi program as follows: raj> mpicc helloworld.c -o helloworld raj> mpirun -am ft-enable-cr helloworld 4) To checkpoint the running program, raj> ompi-checkpoint [any option] pid for example: ompi-checkpoint -v 11527 5) To restart your checkpoint, locate the checkpoint file and type the following from the command line: raj> mpi-restart ompi_global_snapshot_.ckpt The did another test with BLCR however, I tried checkpointing my c application from the /tmp directory instead of my $HOME directory and it checkpointed fine. So, it looks like the problem is with my $HOME directory. I have "drwx" rights on my $HOME directory which seems fine for me. Then i tried it with open MPI. However, with open mpi the checkpoint file automatically get saved in the $HOME directory. Is there a way to have the file saved in a different location? I checked that LAM/MPI has some command line options : $ mpirun -np 2 -ssi cr_base_dir /somewhere/else a.out Do we have a similar option for open mpi? Thanks a lot regards, Raj --- On Wed, 6/17/09, Josh Hursey wrote: > From: Josh Hursey > Subject: Re: [OMPI users] vfs_write returned -14 > To: "Open MPI Users" > Date: Wednesday, June 17, 2009, 1:42 AM > Did you try checkpointing a non-MPI > application with BLCR on the > cluster? If that does not work then I would suspect that > BLCR is not > working properly on the system. > > However if a non-MPI application can be checkpointed and > restarted > correctly on this machine then it may be something odd with > the Open > MPI installation or runtime environment. To help debug here > I would > need to know how Open MPI was configured and how the > application was > ran on the machine (command line arguments, environment > variables, ...). > > I should note that for the program that you sent it is > important that > you compile Open MPI with the Fault Tolerance Thread > enabled to ensure > a timely checkpoint. Otherwise the checkpoint will be > delayed until > the MPI program enters the MPI_Finalize function. > > Let me know what you find out. > > Josh > > On Jun 16, 2009, at 5:08 PM, Kritiraj Sajadah wrote: > > > > > Hi Josh, > > > > Thanks for the email. I have install BLCR 0.8.1 and > openmpi 1.3 on > > my laptop with Ubuntu 8.04 on it. It works fine. > > > > I now tried the installation on the cluster ( on one > machine for > > now) in my university. ( the administrator installed > it) i am not > > sure if he followed the steps i gave him. > > > > I am checkpointing a simple mpi application which > looks as follows: > > > > #include > > #include > > > > int main(int argc, char **argv) > > { > > int rank,size; > > MPI_Init(&argc, &argv); > > MPI_Comm_rank(MPI_COMM_WORLD, &rank); > > MPI_Comm_size(MPI_COMM_WORLD, &size); > > printf("I am processor no %d of a total of %d procs > \n", rank, size); > > system("sleep 30"); > > printf("I am processor no %d of a total of %d procs > \n", rank, size); > > system("sleep 30"); > > printf("I am processor no %d of a total of %d procs > \n", rank, size); > > system("sleep 30"); > > printf("bye \n"); > > MPI_Finalize(); > > return 0; > > } > > > > Do you think its better to re install BLCR? > > > > > > Thanks > > > > Raj > > --- On Tue, 6/16/09, Josh Hursey > wrote: > > > >> From: Josh Hursey > >> Subject: Re: [OMPI users] vfs_write returned -14 > >> To: "Open MPI Users"
Re: [OMPI users] vfs_write returned -14
Hi Josh, Thanks for the email. I have install BLCR 0.8.1 and openmpi 1.3 on my laptop with Ubuntu 8.04 on it. It works fine. I now tried the installation on the cluster ( on one machine for now) in my university. ( the administrator installed it) i am not sure if he followed the steps i gave him. I am checkpointing a simple mpi application which looks as follows: #include #include int main(int argc, char **argv) { int rank,size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("I am processor no %d of a total of %d procs \n", rank, size); system("sleep 30"); printf("I am processor no %d of a total of %d procs \n", rank, size); system("sleep 30"); printf("I am processor no %d of a total of %d procs \n", rank, size); system("sleep 30"); printf("bye \n"); MPI_Finalize(); return 0; } Do you think its better to re install BLCR? Thanks Raj --- On Tue, 6/16/09, Josh Hursey wrote: > From: Josh Hursey > Subject: Re: [OMPI users] vfs_write returned -14 > To: "Open MPI Users" > Date: Tuesday, June 16, 2009, 6:42 PM > > These are errors from BLCR. It may be a problem with your > BLCR installation and/or your application. Are you able to > checkpoint/restart a non-MPI application with BLCR on these > machines? > > What kind of MPI application are you trying to checkpoint? > Some of the MPI interfaces are not fully supported at the > moment (outlined in the FT User Document that I mentioned in > a previous email). > > -- Josh > > On Jun 16, 2009, at 11:30 AM, Kritiraj Sajadah wrote: > > > > > Dear All, > > I have install > openmpi 1.3 and blcr 0.8.1 on a linux machine (ubuntu). > however, when i try checkpointing an MPI application, I get > the following error: > > > > - vfs_write returned -14 > > - file_header: write returned -14 > > > > Can someone help please. > > > > Regards, > > > > Raj > > > > > > > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
[OMPI users] vfs_write returned -14
Dear All, I have install openmpi 1.3 and blcr 0.8.1 on a linux machine (ubuntu). however, when i try checkpointing an MPI application, I get the following error: - vfs_write returned -14 - file_header: write returned -14 Can someone help please. Regards, Raj
[OMPI users] Segmentation fault (11)
Dear All, I have installed BLCR 0.8.1 and OPENMPI 1.3 on a linux platform. However, when i tried checkpoiting an application, it hangs forever just before ending. A chekcpoint file is generated. However, when i try restarting it, i get the following error: raj@sun06:~$ ompi-restart ompi_global_snapshot_22390.ckpt [sun06:22423] *** Process received signal *** [sun06:22423] Signal: Segmentation fault (11) [sun06:22423] Signal code: Address not mapped (1) [sun06:22423] Failing at address: (nil) [sun06:22423] [ 0] [0xb7fb640c] [sun06:22423] [ 1] /usr/local/openmpi/lib/libopen-pal.so.0(opal_crs_blcr_restart+0x103) [0xb7f76925] [sun06:22423] [ 2] opal-restart [0x8049435] [sun06:22423] [ 3] /lib/libc.so.6(__libc_start_main+0xe5) [0xb7d9a455] [sun06:22423] [ 4] opal-restart [0x8049001] [sun06:22423] *** End of error message *** -- mpirun noticed that process rank 0 with PID 22423 on node sun06 exited on signal 11 (Segmentation fault). -- Any help will be very appreciated. kind regards, Raj
[OMPI users] Compiling and Building OPENMPI for checkpointing using self
HI All, I have successfully install and configured openmpi to perfrom checkpointing using the BLCR mechanism. However, i now want to to try checkpointing using self. Has anyone do that? If so, i would very much appreciate if anyone of you could sent be the steps necessary to enable slef checkpointing. Many thanks. Raj
Re: [OMPI users] *** An error occurred in MPI_Init
Hi Gus, Thanks for your email. I have /usr/local/bin included in my $PATH. (Not /usr/local/include - it was just a copying mistake). I checked where mpicc and mpirun are and i got the following path /usr/local/bin/mpirun /usr/local/bin/mpicc The BLCR I am using was downloaded and installed seperately. 1) Do you think i may be using the wrong version of BLCR?. There is a directory called blcr within the openmpi tarball (openmpi-1.3/opal/mca/crs/blcr). Should I use this? 2) DO you think it's better to install openmpi in /usr/local/openmpi and blcr in/usr/local/blcr? 3) If so, how do i uninstall the one i have already? Thank you Kritiraj --- On Fri, 5/8/09, Gus Correa wrote: > From: Gus Correa > Subject: Re: [OMPI users] *** An error occurred in MPI_Init > To: "Open MPI Users" > Date: Friday, May 8, 2009, 6:33 PM > PS - Kritiraj > > Reading your message more carefully, I saw that you did > this: > > > Open the $HOME/.bashrc and added the following: > > PATH="/usr/local/include:$PATH" > LD_LIBRARY_PATH="/usr/local/lib:$LD_LIBRARY_PATH" > > > > However, this is what you should have done: > > > Open the $HOME/.bashrc and added the following: > > PATH="/usr/local/bin:$PATH" > LD_LIBRARY_PATH="/usr/local/lib:$LD_LIBRARY_PATH" > > > > Note that /usr/local/bin, not /usr/local/include should be > pre-pended to your PATH! > > > Gus Correa > - > Gustavo Correa > Lamont-Doherty Earth Observatory - Columbia University > Palisades, NY, 10964-8000 - USA > - > > > Gus Correa wrote: > > Hi Kritiraj > > > > This looks like as many other errors reported on this > list > > that are caused by using the wrong MPI compiler > wrappers > > or the wrong mpirun/mpiexec. > > Typically this is caused by a PATH environment > variable that > > is pointing to the wrong executables (mpicc, mpirun). > > Most Linux distributions, compilers, etc, come with > their > > own MPI versions, and this can be very confusing. > > > > Try using full path names for mpicc and for mpirun. > > That is bullet proof method to get exactly what you > want. > > In your case use /usr/local/bin (as you configured > with --prefix=/usr/local). > > (Actually, I prefer to configure with a more > distinctive > > name to the prefix, something like > /usr/local/openmpi-1.3.2, > > to avoid any confusion with other MPIs.) > > > > You can also try "which mpicc" and "which mpirun", > > or "mpicc --showme" and "mpirun --help" to get a bit > more > > information about what you are really using. > > > > I hope this helps. > > Gus Correa > > > - > > Gustavo Correa > > Lamont-Doherty Earth Observatory - Columbia > University > > Palisades, NY, 10964-8000 - USA > > > - > > > > > > Kritiraj Sajadah wrote: > >> Dear All, > >> I > have install and configured openmpi with BLCR on my laptop: > >> > >> 1) configure and install blcr > >> > >> ./configure --prefix=/usr/local/ > --enable-debug=yes --enable-libcr-tracing=yes > --enable-kernel-tracing=yes --enable-testsuite=yes > --enable-all-static=yes --enable-static=yes > >> > >> make > >> make install > >> > >> 2) configure and install openmpi > >> > >> ./configure --prefix=/usr/local/ --enable-picky > --enable-debug --enable-mpi-profile --enable-mpi-cxx > --enable-pretty-print-stacktrace --enable-binaries > --enable-trace --enable-static=yes --enable-debug > --with-devel-headers=1 --with-mpi-param-check=always > --with-ft=cr --enable-ft-thread --with-blcr=/usr/local/ > --with-blcr-libdir=/usr/local/lib --enable-mpi-threads=yes > >> > >> make all install > >> > >> 3) add the environment variables. > >> > >> > >> Open the $HOME/.bashrc and added the following: > >> > >> PATH="/usr/local/include:$PATH" > >> LD_LIBRARY_PATH="/usr/local/lib:$LD_LIBRARY_PATH" > >> > >> Now the problem: > >> > >> I am trying to checkpoint the following MPI > application: > >> > >> #include > >> #include > >> > >> main(int argc,
[OMPI users] *** An error occurred in MPI_Init
Dear All, I have install and configured openmpi with BLCR on my laptop: 1) configure and install blcr ./configure --prefix=/usr/local/ --enable-debug=yes --enable-libcr-tracing=yes --enable-kernel-tracing=yes --enable-testsuite=yes --enable-all-static=yes --enable-static=yes make make install 2) configure and install openmpi ./configure --prefix=/usr/local/ --enable-picky --enable-debug --enable-mpi-profile --enable-mpi-cxx --enable-pretty-print-stacktrace --enable-binaries --enable-trace --enable-static=yes --enable-debug --with-devel-headers=1 --with-mpi-param-check=always --with-ft=cr --enable-ft-thread --with-blcr=/usr/local/ --with-blcr-libdir=/usr/local/lib --enable-mpi-threads=yes make all install 3) add the environment variables. Open the $HOME/.bashrc and added the following: PATH="/usr/local/include:$PATH" LD_LIBRARY_PATH="/usr/local/lib:$LD_LIBRARY_PATH" Now the problem: I am trying to checkpoint the following MPI application: #include #include main(int argc, char **argv) { int node; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &node); printf("Hello World from Node %d\n",node); MPI_Finalize(); } I am running mpirun as follows: raj-laptop> mpirun -am ft-enable-cr helloworld. The errors are as follows: -- It looks like opal_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during opal_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): opal_cr_init() failed failed --> Returned value -1 instead of OPAL_SUCCESS -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [raj-laptop:9439] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! [raj-laptop:09439] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file runtime/orte_init.c at line 77 -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_mpi_init: orte_init failed --> Returned "Error" (-1) instead of "Success" (0) -- Is it something to do with me running it on a single node; i.e my laptop? or is it something to do with configurations or libraries? Any help will be very appreciated. Regards, Raj
[OMPI users] error while loading shared libraries: libcr.so.0: cannot open shared object file: No such file or directory.
Dear All, I have install openmpi and blcr on my laptop and is trying to checkpoint an mpi application. Both openmpi and blcr are installed in /usr/local. When i try to checkpoint and mpi application, i get the following error: error while loading shared libraries: libcr.so.0: cannot open shared object file: No such file or directory. Any help would be very much appreciated. Regards, Raj
Re: [OMPI users] mca: base: component_find: unable to open/usr/local/lib/openmpi/mca_crs_blcr: file not found (ignored)
Hi Jeff, In fact i am testing it on my laptop before installing it on the cluster. I downloaded BLCR and installed it in /usr/local on my laptop Then i installed openmpi using the following option: ./configure --prefix=/usr/local --with-ft=cr --enable-ft-thread --enable-mpi-threads --with-blcr=/usr/local/lib So, everything is installed and tested on my laptop for now but i am still getting the error. Please help. Thanks Raj --- On Mon, 5/4/09, Jeff Squyres wrote: > From: Jeff Squyres > Subject: Re: [OMPI users] mca: base: component_find: unable to > open/usr/local/lib/openmpi/mca_crs_blcr: file not found (ignored) > To: "Open MPI Users" > Date: Monday, May 4, 2009, 2:09 PM > On May 4, 2009, at 9:06 AM, Kritiraj > Sajadah wrote: > > > raj@raj:mpirun -np 1 -am ft-enable-cr mpisleep > > > > I got the following with no checkpointing performed: > > raj@raj:mca: base: component_find: unable to open > /usr/local/lib/openmpi/mca_crs_blcr: file not found > (ignored) > > > > This is usually a faulty error message from libltdl. > It usually means that the dependent libraries for a > component cannot be found -- e.g., is blcr installed on > every node where you're trying to use it? > > --Jeff Squyres > Cisco Systems > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
[OMPI users] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_crs_blcr: file not found (ignored)
Dear All, Thanks to Josh and Yaakoub, i was able to configure my openmpi as follows: raj@raj:./configure --prefix=/usr/local --with-ft=cr --enable-ft-thread --enable-mpi-threads --with-blcr=/usr/local. raj@raj:make all install I try to checkppoint an mpi application using the following command running on a single node: raj@raj:mpirun -np 1 -am ft-enable-cr mpisleep I got the following with no checkpointing performed: raj@raj:mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_crs_blcr: file not found (ignored) Please help. Regards, Raj
[OMPI users] Checkpointing configuration problem
Dear all, I am trying to install openmpi 1.3 on my laptop. I successfully installed BLCR in /usr/local. When installing openmpi using the following options: ./configure --prefix=/usr/local --with-ft=cr --enable-ft-thread --enable-MPI-thread --with-blcr=/usr/local I got the following error: == System-specific tests ... checking if want fault tolerance thread... Must enable progress or MPI threads to use this option configure: error: Cannot continue Help please. regards, Raj
[OMPI users] checkpoint file contains nothing
HI, I have installed the openmpi-1.3a1r18651 and tried to checkpoint an mpi application. raj@portal018:~/examples> mpirun -np 1 -am ft-enable-cr ./myapp.sh & raj@portal018:~/examples> ompi-checkpoint --term 30416 However, when i try to restart the checkped file, I get the following message. raj@portal018:~> ompi-restart -v -machinefile portal018 ompi_global_snapshot_30416.ckpt [portal018:20178] Checking for the existence of (/home/raj/ompi_global_snapshot_30416.ckpt) [portal018:20178] Restarting from file (ompi_global_snapshot_30416.ckpt) [portal018:20178]Exec in self -- mpirun could not find anything to do. It is possible that you forgot to specify how many processes to run via the "-np" argument. -- Any help will be very appreciated. Regards, Raj