Re: [OMPI users] Ompi-restart failed and process migration
The ~/.openmpi/mca-params.conf file should contain the same information on all nodes. You can install Open MPI as root. However, we do not recommend that you run Open MPI as root. If the user $HOME directory is NFS mounted, then you can use an NFS mounted directory to store your files. With this option you do not need to use the local disk. For an NFS mounted directory you only need to set: snapc_base_global_snapshot_dir=/path_to_NFS_directory/ If you need to stage the files then the following options are what you need. snapc_base_store_in_place=0 snapc_base_global_snapshot_dir=/path_to_global_storage_dir/ crs_base_snapshot_dir=/path_to_local_storage_dir/ As you start getting setup, I would recommend the NFS options to reduce the number of variables that you need to worry about to get the basic setup working. -- Josh On Tue, Apr 24, 2012 at 11:43 AM, kidd <q19860...@yahoo.com.tw> wrote: > Hi ,Thank you For your reply. > I have some problem: > Q1: I setting 2 kinds mac.para.conf > (1) crs_base_snapshot_dir=/root/kidd_openMPI/Tmp > snapc_base_global_snapshot_dir=/root/kidd_openMPI/checkpoints > > My Master : /root/kidd_openMPI is My opempi-Installed Dir > ,it is Shared by NFS . > Do I have to mount a User_Account , Rather than a dir ? > > > (2) snapc_base_store_in_place=0 > crs_base_snapshot_dir= /tmp/OmpiStore/local > snapc_base_global_snapshot_dir= /tmp/OmpiStore/global > > In this case ,I not use NFS in OmpiStore/local & > OmpiStore/local; > is it right ? > (3) >Do I setting .openmpi in all-Node ,or just seting on Master . > > (4) I install openmpi in root ,should I move to > General-user-account ? > > > 寄件者: Josh Hursey <jjhur...@open-mpi.org> > 收件者: Open MPI Users <us...@open-mpi.org> > 寄件日期: 2012/4/24 (週二) 10:58 PM > > 主旨: Re: [OMPI users] Ompi-restart failed and process migration > > On Tue, Apr 24, 2012 at 10:10 AM, kidd <q19860...@yahoo.com.tw> wrote: >> Hi ,Thank you For your reply. >> but I still failed. I must add -x LD_LIBRARY_PATH >> this is my All Setting ; >> 1) Master-Node(cuda07) & Slaves Node(cuda08) : >>Configure: >>./configure --prefix=/root/kidd_openMPI --with-ft=cr >> --enable-ft-thread --with-blcr=/usr/local/BLCR >>--with-blcr-libdir=/usr/local/BLCR/lib >> --enable-mpirun-prefix-by-default >>--enable-static --enable-shared --enable-opal-progress-threads; make ; >> make install; >> >> (2) Path && LD_PATH: >> #In /etc/profile >> ==>export PATH=$PATH:/usr/local/BLCR/bin ; >> ==>export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/BLCR/lib >>#In ~/.bashrc >> ==>export PATH=$PATH:/root/kidd_openMPI/bin >> ==>export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/root/kidd_openMPI/lib >> >>(3) Compiler && Running: >> ==> ~/kidd_openMPI/NBody_TEST# mpicc -o TEST -DDEFSIZE=5000 \ >> -DDEF_PROC=2 MPINbodyOMP.c >> >> ==> root@cuda07:~/kidd_openMPI/NBody_TEST# mpirun -hostfile Hosts >> -np 2 TEST >> >> TEST: error while loading shared libraries: libcr.so.0: cannot open >> shared >> object file: No such file or directory > > > I still think the core problem is with the search path given this > message. Open MPI is trying to load BLCR's libcr.so.0, and it is not > finding the library in the LD_LIBRARY_PATH search path. Something is > still off in the backend nodes. Try adding the BLCR > PATH/LD_LIBRARY_PATH to your .bashrc instead of the profile. > > >> >>==> I make sure Master and Slave have same Install and same Path . >>I let slave-node using cr_restart restart a contextfile >> ,the >> contextfile checked by Master ,so >>Blcr can work; >>but it still cannot open shared object file->libcr.so.0: > > > So BLCR is giving this error? > >> >> (4) ifI pass -x LD_LIBRARY_PATH >> ( local mount ) >> (4-1)My mca-params.conf(In Master ) >> ==> snapc_base_store_in_place=0 >> crs_base_snapshot_dir=/tmp/OmpiStore/local >> snapc_base_global_snapshot_dir=/tmp/OmpiStore/global >> >> step 1: mpirun -hostfile Hosts -np 2 -x LD_LIBRARY_PATH -am >> ft-enable-cr ./TEST >> step 2: ompi-checkpoint -term Pid ( I use another command
Re: [OMPI users] Ompi-restart failed and process migration
Hi ,Thank you For your reply. I have some problem: Q1: I setting 2 kinds mac.para.conf (1) crs_base_snapshot_dir=/root/kidd_openMPI/Tmp snapc_base_global_snapshot_dir=/root/kidd_openMPI/checkpoints My Master : /root/kidd_openMPI is My opempi-Installed Dir ,it is Shared by NFS . Do I have to mount a User_Account , Rather than a dir ? (2) snapc_base_store_in_place=0 crs_base_snapshot_dir= /tmp/OmpiStore/local snapc_base_global_snapshot_dir= /tmp/OmpiStore/global In this case ,I not use NFS in OmpiStore/local &OmpiStore/local; is it right ? (3) Do I setting .openmpi in all-Node ,or just seting on Master . (4) I install openmpi in root ,should I move to General-user-account ? 寄件者: Josh Hursey <jjhur...@open-mpi.org> 收件者: Open MPI Users <us...@open-mpi.org> 寄件日期: 2012/4/24 (週二) 10:58 PM 主旨: Re: [OMPI users] Ompi-restart failed and process migration On Tue, Apr 24, 2012 at 10:10 AM, kidd <q19860...@yahoo.com.tw> wrote: > Hi ,Thank you For your reply. > but I still failed. I must add -x LD_LIBRARY_PATH > this is my All Setting ; > 1) Master-Node(cuda07) & Slaves Node(cuda08) : > Configure: > ./configure --prefix=/root/kidd_openMPI --with-ft=cr > --enable-ft-thread --with-blcr=/usr/local/BLCR > --with-blcr-libdir=/usr/local/BLCR/lib --enable-mpirun-prefix-by-default > --enable-static --enable-shared --enable-opal-progress-threads; make ; > make install; > > (2) Path && LD_PATH: > #In /etc/profile > ==>export PATH=$PATH:/usr/local/BLCR/bin ; > ==>export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/BLCR/lib > #In ~/.bashrc > ==>export PATH=$PATH:/root/kidd_openMPI/bin > ==>export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/root/kidd_openMPI/lib > > (3) Compiler && Running: > ==> ~/kidd_openMPI/NBody_TEST# mpicc -o TEST -DDEFSIZE=5000 \ > -DDEF_PROC=2 MPINbodyOMP.c > > ==> root@cuda07:~/kidd_openMPI/NBody_TEST# mpirun -hostfile Hosts > -np 2 TEST > > TEST: error while loading shared libraries: libcr.so.0: cannot open shared > object file: No such file or directory I still think the core problem is with the search path given this message. Open MPI is trying to load BLCR's libcr.so.0, and it is not finding the library in the LD_LIBRARY_PATH search path. Something is still off in the backend nodes. Try adding the BLCR PATH/LD_LIBRARY_PATH to your .bashrc instead of the profile. > > ==> I make sure Master and Slave have same Install and same Path . > I let slave-node using cr_restart restart a contextfile ,the > contextfile checked by Master ,so > Blcr can work; > but it still cannot open shared object file->libcr.so.0: So BLCR is giving this error? > > (4) if I pass -x LD_LIBRARY_PATH > ( local mount ) > (4-1)My mca-params.conf(In Master ) > ==> snapc_base_store_in_place=0 > crs_base_snapshot_dir=/tmp/OmpiStore/local > snapc_base_global_snapshot_dir=/tmp/OmpiStore/global > > step 1: mpirun -hostfile Hosts -np 2 -x LD_LIBRARY_PATH -am > ft-enable-cr ./TEST > step 2: ompi-checkpoint -term Pid ( I use another command) > step 3: > cd /tmp/OmpiStore/global > ==> ompi-restart Ompi_Pid.ckpt . (all process > Only Restart on Master) > ==> ompi-restart --hostfile Host Ompi_Pid.ckpt . > Error-Message: > root@cuda07:/tmp/OmpiStore/global# > ompi-restart --preload -hostfile Hosts ompi_global_snapshot_8873.ckpt/ > Warning: Permanently added the RSA host key for IP address '192.168.1.10' to > the list of known hosts. > -- > WARNING: Remote peer ([[37567,0],1]) failed to preload a file. > Exit Status: 256 > Local File: /tmp/OmpiStore/global/./opal_snapshot_1.ckpt > Remote File: > /tmp/OmpiStore/global/ompi_global_snapshot_8873.ckpt/0/opal_snapshot_1.ckpt > Command: > scp -r > cuda07:/tmp/OmpiStore/global/ompi_global_snapshot_8873.ckpt/0/opal_snapshot_1.ckpt > \ > /tmp/OmpiStore/global/./opal_snapshot_1.ckpt > > Will continue attempting to launch the process(es). > -- > [cuda08:08899] Error: Unable to access the path [./opal_snapshot_1.ckpt]! > -- > Error: The filename (opal_snaps
Re: [OMPI users] Ompi-restart failed and process migration
Hi ,Thank you For your reply. but I still failed. I must add -x LD_LIBRARY_PATH this is my All Setting ; 1) Master-Node(cuda07) & Slaves Node(cuda08) : Configure: ./configure --prefix=/root/kidd_openMPI --with-ft=cr --enable-ft-thread --with-blcr=/usr/local/BLCR --with-blcr-libdir=/usr/local/BLCR/lib --enable-mpirun-prefix-by-default --enable-static --enable-shared --enable-opal-progress-threads; make ; make install; (2) Path && LD_PATH: #In /etc/profile ==>export PATH=$PATH:/usr/local/BLCR/bin ; ==>export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/BLCR/lib #In ~/.bashrc ==>export PATH=$PATH:/root/kidd_openMPI/bin ==>export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/root/kidd_openMPI/lib (3) Compiler && Running: ==> ~/kidd_openMPI/NBody_TEST# mpicc -o TEST -DDEFSIZE=5000 \ -DDEF_PROC=2 MPINbodyOMP.c ==> root@cuda07:~/kidd_openMPI/NBody_TEST# mpirun -hostfile Hosts -np 2 TEST TEST: error while loading shared libraries: libcr.so.0: cannot open shared object file: No such file or directory ==> I make sure Master and Slave have same Install and same Path . I let slave-node using cr_restart restart a contextfile ,the contextfile checked by Master ,so Blcr can work; but it still cannot open shared object file->libcr.so.0: (4) if I pass -x LD_LIBRARY_PATH ( local mount ) (4-1)My mca-params.conf(In Master ) ==> snapc_base_store_in_place=0 crs_base_snapshot_dir=/tmp/OmpiStore/local snapc_base_global_snapshot_dir=/tmp/OmpiStore/global step 1: mpirun -hostfile Hosts -np 2 -x LD_LIBRARY_PATH -am ft-enable-cr ./TEST step 2: ompi-checkpoint -term Pid ( I use another command) step 3: cd /tmp/OmpiStore/global ==> ompi-restart Ompi_Pid.ckpt . (all process Only Restart on Master) ==> ompi-restart --hostfile Host Ompi_Pid.ckpt . Error-Message: root@cuda07:/tmp/OmpiStore/global# ompi-restart --preload -hostfile Hosts ompi_global_snapshot_8873.ckpt/ Warning: Permanently added the RSA host key for IP address '192.168.1.10' to the list of known hosts. -- WARNING: Remote peer ([[37567,0],1]) failed to preload a file. Exit Status: 256 Local File: /tmp/OmpiStore/global/./opal_snapshot_1.ckpt Remote File: /tmp/OmpiStore/global/ompi_global_snapshot_8873.ckpt/0/opal_snapshot_1.ckpt Command: scp -r cuda07:/tmp/OmpiStore/global/ompi_global_snapshot_8873.ckpt/0/opal_snapshot_1.ckpt \ /tmp/OmpiStore/global/./opal_snapshot_1.ckpt Will continue attempting to launch the process(es). -- [cuda08:08899] Error: Unable to access the path [./opal_snapshot_1.ckpt]! -- Error: The filename (opal_snapshot_1.ckpt) is invalid because either you have not provided a filename or provided an invalid filename. Please see --help for usage. -- I am 0 loop=40 in #pragma time1=446.558860 ^Cmpirun: killing job... /*---*/ (5)A couple solutions: > - have the PATH and LD_LIBRARY_PATH set the same on all nodes > - have ompi-restart pass the -x parameter to the underlying mpirun by > using the -mpirun_opts command line switch: > ompi-restart --mpirun_opts "-x LD_LIBRARY_PATH" .. How to Using --mpirun_opts ? this is my command ==> ompi-restart --mpirun_opts -x LD_LIBRARY_PATH -hostfile Hosts \ ompi_global_snapshot_8873.ckpt/ but it is Error. thanks. 寄件者: Josh Hursey <jjhur...@open-mpi.org> 收件者: Open MPI Users <us...@open-mpi.org> 寄件日期: 2012/4/24 (週二) 3:23 AM 主旨: Re: [OMPI users] Ompi-restart failed and process migration On Mon, Apr 23, 2012 at 2:45 PM, kidd <q19860...@yahoo.com.tw> wrote: > Hi ,Thank you For your reply. > > I have some problems: > (1) > Now ,In the my platform , all nodes have the same path and LD_LIBRARY_PATH. > I set in .bashrc > // > #BLCR > export PATH=$PATH:/usr/local/BLCR/bin > export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/BLCR/lib > #openMPI > export PATH=$PATH:/root/kidd_openMPI/bin > export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/root/kidd_openMPI/lib > /---/ > but ,when I runnin
Re: [OMPI users] Ompi-restart failed and process migration
On Mon, Apr 23, 2012 at 2:45 PM, kidd <q19860...@yahoo.com.tw> wrote: > Hi ,Thank you For your reply. > > I have some problems: > (1) > Now ,In the my platform , all nodes have the same path and LD_LIBRARY_PATH. > I set in .bashrc > // > #BLCR > export PATH=$PATH:/usr/local/BLCR/bin > export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/BLCR/lib > #openMPI > export PATH=$PATH:/root/kidd_openMPI/bin > export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/root/kidd_openMPI/lib > /---/ > but ,when I running mpirun , I have to add " -x LD_LIBRARY_PATH" ,or > it can't run > example: mpirun -hostfile hosts -np 2 ./TEST . > Error Message==> > ./TEST: error while loading shared libraries: libcr.so.0: cannot open shared > object file: No such file or directory It sounds like something is still not quite right with your environment and system setup. If you have set the PATH and LD_LIBRARY_PATH appropriately on all nodes then you should not have to pass the "-x LD_LIBRARY_PATH" option to mpirun. Additionally, the error you are seeing is from BLCR. That error seems to indicate that BLCR is not installed correctly on all nodes. Some things to look into (in this order): 1) Make sure that you have BLCR and Open MPI installed in the same location on all machines. 2) Make sure that BLCR works on all machines by checkpointing and restarting a single process program 3) Make sure that Open MPI works on all machines -without- checkpointing, and without passing the -x option. 4) Checkpoint/restart an MPI job > (2) BLCR need to unify linux-kernel of all the Node ? > Now ,I reset all Node.(using Ubuntu 10.04) I do not understand what you are trying to ask here. Please rephrase. > (3) > Now , My porgram using DLL . I implements some DLL ,MPI-Program > calls DLLs . > Ompi can check/Restart Program contains DLL ? I do not understand what you are trying to ask here. Please rephrase. -- Josh > > > > 寄件者: Josh Hursey <jjhur...@open-mpi.org> > 收件者: Open MPI Users <us...@open-mpi.org> > 寄件日期: 2012/4/23 (週一) 10:51 PM > 主旨: Re: [OMPI users] Ompi-restart failed and process migration > > I wonder if the LD_LIBRARY_PATH is not being set properly upon > restart. In your mpirun you pass the '-x LD_LIBRARY_PATH'. > ompi-restart will not pass that variable along for you, so if you are > using that to set the BLCR path this might be your problem. > > A couple solutions: > - have the PATH and LD_LIBRARY_PATH set the same on all nodes > - have ompi-restart pass the -x parameter to the underlying mpirun by > using the -mpirun_opts command line switch: > ompi-restart --mpirun_opts "-x LD_LIBRARY_PATH" ... > > Yes. ompi-restart will let you checkpoint a process on one node and > restart it on another. You will have to restart the whole application > since the ompi-migration operation is not available in the 1.5 series. > > -- Josh > > On Sat, Apr 21, 2012 at 4:11 AM, kidd <q19860...@yahoo.com.tw> wrote: >> Hi all, >> I have Some problems,I wana check/Restart Multiple process on 2 node. >> >> My environment: >> BLCR= 0.8.4 , openMPI= 1.5.5 , OS = ubuntu 11.04 >> I have 2 Node : >> N05(Master ,it have NFS shared file system),N07(slave >> ,mount Master-Node). >> >> My configure format=./configure --prefix=/root/kidd_openMPI >> --with-ft=cr --enable-ft-thread --with-blcr=/usr/local/BLCR >> --with-blcr-libdir=/usr/local/BLCR/lib --enable-mpirun-prefix-by-default >> --enable-static --enable-shared --enable-opal-multi-threads; >> >> I had also set ~/.openmpi/mca-params.conf-> >> crs_base_snapshot_dir=/root/kidd_openMPI/Tmp >> snapc_base_global_snapshot_dir=/root/kidd_openMPI/checkpoints. >> >> the dir->kidd_openMPI is my nfs shared dir. >> >> My Command : >> 1. mpicc -o TEST -DDEFSIZE=3000 -DDEF_PROC=2 -fopenmp MPIMatrix.c >> >> 2. mpirun -hostfile Hosts -am ft-enable-cr -x LD_LIBRARY_PATH >> -np 2 ./TEST . >> >> I can restart process-0 on Master,but process-1 on N07 was failed. >> >> I checked my Node,it does not install the prelink, >> so the error(restart-failed) is caused by other reasons. >> >> Error Message--> >> >> -- >> root@cuda05:~/kidd_openMPI/checkpoints# >> ompi-restart -hostfile Hosts ompi_global_snapshot_2892.ckpt/ >> >&
Re: [OMPI users] Ompi-restart failed and process migration
Hi ,Thank you For your reply. I have some problems: (1) Now ,In the my platform , all nodes have the same pathand LD_LIBRARY_PATH. I set in .bashrc // #BLCR export PATH=$PATH:/usr/local/BLCR/bin export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/BLCR/lib #openMPI export PATH=$PATH:/root/kidd_openMPI/bin export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/root/kidd_openMPI/lib /---/ but ,when I running mpirun , I have to add " -x LD_LIBRARY_PATH" ,or it can't run example: mpirun -hostfile hosts -np 2 ./TEST . Error Message==> ./TEST: error while loading shared libraries: libcr.so.0: cannot open shared object file: No such file or directory (2) BLCR need to unify linux-kernel of all the Node ? Now ,I reset all Node.(using Ubuntu 10.04) (3) Now , My porgram using DLL . I implements some DLL ,MPI-Program calls DLLs . Ompi can check/Restart Program contains DLL ? 寄件者: Josh Hursey <jjhur...@open-mpi.org> 收件者: Open MPI Users <us...@open-mpi.org> 寄件日期: 2012/4/23 (週一) 10:51 PM 主旨: Re: [OMPI users] Ompi-restart failed and process migration I wonder if the LD_LIBRARY_PATH is not being set properly upon restart. In your mpirun you pass the '-x LD_LIBRARY_PATH'. ompi-restart will not pass that variable along for you, so if you are using that to set the BLCR path this might be your problem. A couple solutions: - have the PATH and LD_LIBRARY_PATH set the same on all nodes - have ompi-restart pass the -x parameter to the underlying mpirun by using the -mpirun_opts command line switch: ompi-restart --mpirun_opts "-x LD_LIBRARY_PATH" ... Yes. ompi-restart will let you checkpoint a process on one node and restart it on another. You will have to restart the whole application since the ompi-migration operation is not available in the 1.5 series. -- Josh On Sat, Apr 21, 2012 at 4:11 AM, kidd <q19860...@yahoo.com.tw> wrote: > Hi all, > I have Some problems,I wana check/Restart Multiple process on 2 node. > > My environment: > BLCR= 0.8.4 , openMPI= 1.5.5 , OS = ubuntu 11.04 > I have 2 Node : > N05(Master ,it have NFS shared file system),N07(slave > ,mount Master-Node). > > My configure format=./configure --prefix=/root/kidd_openMPI > --with-ft=cr --enable-ft-thread --with-blcr=/usr/local/BLCR > --with-blcr-libdir=/usr/local/BLCR/lib --enable-mpirun-prefix-by-default > --enable-static --enable-shared --enable-opal-multi-threads; > > I had also set ~/.openmpi/mca-params.conf-> > crs_base_snapshot_dir=/root/kidd_openMPI/Tmp > snapc_base_global_snapshot_dir=/root/kidd_openMPI/checkpoints. > > the dir->kidd_openMPI is my nfs shared dir. > > My Command : > 1. mpicc -o TEST -DDEFSIZE=3000 -DDEF_PROC=2 -fopenmp MPIMatrix.c > > 2. mpirun -hostfile Hosts -am ft-enable-cr -x LD_LIBRARY_PATH > -np 2 ./TEST . > > I can restart process-0 on Master,but process-1 on N07 was failed. > > I checked my Node,it does not install the prelink, > so the error(restart-failed) is caused by other reasons. > > Error Message--> > -- > root@cuda05:~/kidd_openMPI/checkpoints# > ompi-restart -hostfile Hosts ompi_global_snapshot_2892.ckpt/ > -- > Error: BLCR was not able to restart the process because exec failed. > Check the installation of BLCR on all of the machines in your > system. The following information may be of help: > Return Code : -1 > BLCR Restart Command : cr_restart > Restart Command Line : cr_restart > /root/kidd_openMPI/checkpoints/ompi_global_snapshot_2892.ckpt/0/ > opal_snapshot_1.ckpt/ompi_blcr_context.2704 > -- > -- > Error: Unable to obtain the proper restart command to restart from the > checkpoint file (opal_snapshot_1.ckpt). Returned -1. > Check the installation of the blcr checkpoint/restart service > on all of the machines in your system. > ### > problem 2: I wana let MPI-process can migration to another Node. > if Ompi-Restart Multiple-Node can be successful. > Can restart in another new node, rather than the original node? > example: > checkpoint (node1,node2,node3),then restart(node1,node3,node4). > or just restart
Re: [OMPI users] Ompi-restart failed and process migration
I wonder if the LD_LIBRARY_PATH is not being set properly upon restart. In your mpirun you pass the '-x LD_LIBRARY_PATH'. ompi-restart will not pass that variable along for you, so if you are using that to set the BLCR path this might be your problem. A couple solutions: - have the PATH and LD_LIBRARY_PATH set the same on all nodes - have ompi-restart pass the -x parameter to the underlying mpirun by using the -mpirun_opts command line switch: ompi-restart --mpirun_opts "-x LD_LIBRARY_PATH" ... Yes. ompi-restart will let you checkpoint a process on one node and restart it on another. You will have to restart the whole application since the ompi-migration operation is not available in the 1.5 series. -- Josh On Sat, Apr 21, 2012 at 4:11 AM, kiddwrote: > Hi all, > I have Some problems,I wana check/Restart Multiple process on 2 node. > > My environment: > BLCR= 0.8.4 , openMPI= 1.5.5 , OS = ubuntu 11.04 > I have 2 Node : > N05(Master ,it have NFS shared file system),N07(slave > ,mount Master-Node). > > My configure format=./configure --prefix=/root/kidd_openMPI > --with-ft=cr --enable-ft-thread --with-blcr=/usr/local/BLCR > --with-blcr-libdir=/usr/local/BLCR/lib --enable-mpirun-prefix-by-default > --enable-static --enable-shared --enable-opal-multi-threads; > > I had also set ~/.openmpi/mca-params.conf-> > crs_base_snapshot_dir=/root/kidd_openMPI/Tmp > snapc_base_global_snapshot_dir=/root/kidd_openMPI/checkpoints. > > the dir->kidd_openMPI is my nfs shared dir. > > My Command : > 1. mpicc -o TEST -DDEFSIZE=3000 -DDEF_PROC=2 -fopenmp MPIMatrix.c > > 2. mpirun -hostfile Hosts -am ft-enable-cr -x LD_LIBRARY_PATH > -np 2 ./TEST . > > I can restart process-0 on Master,but process-1 on N07 was failed. > > I checked my Node,it does not install the prelink, > so the error(restart-failed) is caused by other reasons. > > Error Message--> > -- > root@cuda05:~/kidd_openMPI/checkpoints# > ompi-restart -hostfile Hosts ompi_global_snapshot_2892.ckpt/ > -- > Error: BLCR was not able to restart the process because exec failed. > Check the installation of BLCR on all of the machines in your > system. The following information may be of help: > Return Code : -1 > BLCR Restart Command : cr_restart > Restart Command Line : cr_restart > /root/kidd_openMPI/checkpoints/ompi_global_snapshot_2892.ckpt/0/ > opal_snapshot_1.ckpt/ompi_blcr_context.2704 > -- > -- > Error: Unable to obtain the proper restart command to restart from the > checkpoint file (opal_snapshot_1.ckpt). Returned -1. > Check the installation of the blcr checkpoint/restart service > on all of the machines in your system. > ### > problem 2: I wana let MPI-process can migration to another Node. > if Ompi-Restart Multiple-Node can be successful. > Can restart in another new node, rather than the original node? >example: > checkpoint (node1,node2,node3),then restart(node1,node3,node4). > or just restart(node1,node3(2-process) ). > > Please help me , thanks . > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey
[OMPI users] Ompi-restart failed and process migration
Hi all, I have Some problems,I wana check/Restart Multiple process on 2 node. My environment: BLCR= 0.8.4 , openMPI= 1.5.5 , OS = ubuntu 11.04 I have 2 Node : N05(Master ,it have NFS shared file system),N07(slave ,mount Master-Node). My configure format=./configure --prefix=/root/kidd_openMPI --with-ft=cr --enable-ft-thread --with-blcr=/usr/local/BLCR --with-blcr-libdir=/usr/local/BLCR/lib --enable-mpirun-prefix-by-default --enable-static --enable-shared --enable-opal-multi-threads; I had also set ~/.openmpi/mca-params.conf-> crs_base_snapshot_dir=/root/kidd_openMPI/Tmp snapc_base_global_snapshot_dir=/root/kidd_openMPI/checkpoints. the dir->kidd_openMPI is my nfs shared dir. My Command : 1. mpicc -o TEST -DDEFSIZE=3000 -DDEF_PROC=2 -fopenmp MPIMatrix.c 2. mpirun -hostfile Hosts -am ft-enable-cr -x LD_LIBRARY_PATH -np 2 ./TEST . I can restart process-0 on Master,but process-1 on N07 was failed. I checked my Node,it does not install the prelink,so the error(restart-failed) is caused by other reasons. Error Message--> -- root@cuda05:~/kidd_openMPI/checkpoints# ompi-restart -hostfile Hosts ompi_global_snapshot_2892.ckpt/ -- Error: BLCR was not able to restart the process because exec failed. Check the installation of BLCR on all of the machines in your system. The following information may be of help: Return Code : -1 BLCR Restart Command : cr_restart Restart Command Line : cr_restart /root/kidd_openMPI/checkpoints/ompi_global_snapshot_2892.ckpt/0/ opal_snapshot_1.ckpt/ompi_blcr_context.2704 -- -- Error: Unable to obtain the proper restart command to restart from the checkpoint file (opal_snapshot_1.ckpt). Returned -1. Check the installation of the blcr checkpoint/restart service on all of the machines in your system. ### problem 2: I wana let MPI-process can migration to another Node. if Ompi-Restart Multiple-Node can be successful. Can restart in another new node, rather than the original node? example: checkpoint (node1,node2,node3),then restart(node1,node3,node4). or just restart(node1,node3(2-process) ). Please help me , thanks .