Re: [OMPI users] Ompi-restart failed and process migration

2012-04-24 Thread Josh Hursey
The ~/.openmpi/mca-params.conf file should contain the same
information on all nodes.

You can install Open MPI as root. However, we do not recommend that
you run Open MPI as root.

If the user $HOME directory is NFS mounted, then you can use an NFS
mounted directory to store your files. With this option you do not
need to use the local disk. For an NFS mounted directory you only need
to set:
  snapc_base_global_snapshot_dir=/path_to_NFS_directory/

If you need to stage the files then the following options are what you need.
  snapc_base_store_in_place=0
  snapc_base_global_snapshot_dir=/path_to_global_storage_dir/
  crs_base_snapshot_dir=/path_to_local_storage_dir/

As you start getting setup, I would recommend the NFS options to
reduce the number of variables that you need to worry about to get the
basic setup working.

-- Josh


On Tue, Apr 24, 2012 at 11:43 AM, kidd <q19860...@yahoo.com.tw> wrote:
> Hi ,Thank you For your reply.
> I have some problem:
> Q1:  I setting 2 kinds  mac.para.conf
> (1) crs_base_snapshot_dir=/root/kidd_openMPI/Tmp
>   snapc_base_global_snapshot_dir=/root/kidd_openMPI/checkpoints
>
>  My Master : /root/kidd_openMPI   is My opempi-Installed Dir
> ,it is  Shared by NFS .
>  Do I have to mount  a   User_Account , Rather than a  dir  ?
>
>
>  (2) snapc_base_store_in_place=0
>   crs_base_snapshot_dir= /tmp/OmpiStore/local
>   snapc_base_global_snapshot_dir= /tmp/OmpiStore/global
>
> In this  case  ,I not use  NFS  in OmpiStore/local  &
> OmpiStore/local;
> is it right ?
>   (3)
>Do I setting .openmpi in all-Node ,or just seting on Master .
>
>   (4)  I install openmpi  in root ,should I move   to
> General-user-account ?
>
> 
> 寄件者: Josh Hursey <jjhur...@open-mpi.org>
> 收件者: Open MPI Users <us...@open-mpi.org>
> 寄件日期: 2012/4/24 (週二) 10:58 PM
>
> 主旨: Re: [OMPI users] Ompi-restart failed and process migration
>
> On Tue, Apr 24, 2012 at 10:10 AM, kidd <q19860...@yahoo.com.tw> wrote:
>> Hi ,Thank you For your reply.
>>  but I still failed. I must add -x  LD_LIBRARY_PATH
>> this is my  All Setting ;
>> 1) Master-Node(cuda07)  &  Slaves Node(cuda08) :
>>Configure:
>>./configure --prefix=/root/kidd_openMPI  --with-ft=cr
>> --enable-ft-thread  --with-blcr=/usr/local/BLCR
>>--with-blcr-libdir=/usr/local/BLCR/lib
>> --enable-mpirun-prefix-by-default
>>--enable-static --enable-shared  --enable-opal-progress-threads; make ;
>> make install;
>>
>>   (2) Path && LD_PATH:
>> #In /etc/profile
>>  ==>export PATH=$PATH:/usr/local/BLCR/bin ;
>>  ==>export  LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/BLCR/lib
>>#In ~/.bashrc
>> ==>export PATH=$PATH:/root/kidd_openMPI/bin
>> ==>export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/root/kidd_openMPI/lib
>>
>>(3) Compiler && Running:
>>   ==> ~/kidd_openMPI/NBody_TEST#  mpicc -o  TEST -DDEFSIZE=5000  \
>>   -DDEF_PROC=2 MPINbodyOMP.c
>>
>>   ==>   root@cuda07:~/kidd_openMPI/NBody_TEST# mpirun -hostfile Hosts
>> -np 2 TEST
>>
>>   TEST: error while loading shared libraries: libcr.so.0: cannot open
>> shared
>> object file: No such file or directory
>
>
> I still think the core problem is with the search path given this
> message. Open MPI is trying to load BLCR's libcr.so.0, and it is not
> finding the library in the LD_LIBRARY_PATH search path. Something is
> still off in the backend nodes. Try adding the BLCR
> PATH/LD_LIBRARY_PATH to your .bashrc instead of the profile.
>
>
>>
>>==> I make sure  Master and Slave  have  same Install and  same Path .
>>I  let slave-node  using cr_restart   restart a contextfile
>> ,the
>> contextfile checked by Master ,so
>>Blcr  can work;
>>but it still  cannot open shared object file->libcr.so.0:
>
>
> So BLCR is giving this error?
>
>>
>>   (4)  ifI pass  -x LD_LIBRARY_PATH
>>  ( local mount )
>> (4-1)My mca-params.conf(In Master )
>>  ==> snapc_base_store_in_place=0
>>  crs_base_snapshot_dir=/tmp/OmpiStore/local
>>  snapc_base_global_snapshot_dir=/tmp/OmpiStore/global
>>
>>   step 1: mpirun -hostfile Hosts -np 2 -x LD_LIBRARY_PATH -am
>> ft-enable-cr ./TEST
>>   step 2: ompi-checkpoint -term Pid ( I use another command

Re: [OMPI users] Ompi-restart failed and process migration

2012-04-24 Thread kidd
Hi ,Thank you For your reply.
I have some problem:
Q1:  I setting 2 kinds  mac.para.conf

    (1) crs_base_snapshot_dir=/root/kidd_openMPI/Tmp
  snapc_base_global_snapshot_dir=/root/kidd_openMPI/checkpoints

 My Master : /root/kidd_openMPI   is My opempi-Installed Dir  ,it 
is  Shared by NFS .
 Do I have to mount  a   User_Account , Rather than a  dir  ?
   

    

 (2) snapc_base_store_in_place=0
  crs_base_snapshot_dir= /tmp/OmpiStore/local
  snapc_base_global_snapshot_dir= /tmp/OmpiStore/global

    In this  case  ,I not use  NFS  in OmpiStore/local  
&OmpiStore/local;
    is it right ?
  (3)
   Do I setting .openmpi in all-Node ,or just seting on Master .
   

  (4)  I install openmpi  in root ,should I move   to  
General-user-account ? 

    



 寄件者: Josh Hursey <jjhur...@open-mpi.org>
收件者: Open MPI Users <us...@open-mpi.org> 
寄件日期: 2012/4/24 (週二) 10:58 PM
主旨: Re: [OMPI users] Ompi-restart failed and process migration
 
On Tue, Apr 24, 2012 at 10:10 AM, kidd <q19860...@yahoo.com.tw> wrote:
> Hi ,Thank you For your reply.
>  but I still failed. I must add -x  LD_LIBRARY_PATH
> this is my  All Setting ;
> 1) Master-Node(cuda07)  &  Slaves Node(cuda08) :
>    Configure:
>    ./configure --prefix=/root/kidd_openMPI  --with-ft=cr
> --enable-ft-thread  --with-blcr=/usr/local/BLCR
>    --with-blcr-libdir=/usr/local/BLCR/lib  --enable-mpirun-prefix-by-default
>    --enable-static --enable-shared  --enable-opal-progress-threads; make ;
> make install;
>
>   (2) Path && LD_PATH:
>     #In /etc/profile
>  ==>export PATH=$PATH:/usr/local/BLCR/bin ;
>  ==>export  LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/BLCR/lib
>    #In ~/.bashrc
>     ==>export PATH=$PATH:/root/kidd_openMPI/bin
>     ==>export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/root/kidd_openMPI/lib
>
>    (3) Compiler && Running:
>   ==> ~/kidd_openMPI/NBody_TEST#  mpicc -o  TEST -DDEFSIZE=5000  \
>               -DDEF_PROC=2 MPINbodyOMP.c
>
>   ==>   root@cuda07:~/kidd_openMPI/NBody_TEST# mpirun -hostfile Hosts
> -np 2 TEST
>
>   TEST: error while loading shared libraries: libcr.so.0: cannot open shared
> object file: No such file or directory


I still think the core problem is with the search path given this
message. Open MPI is trying to load BLCR's libcr.so.0, and it is not
finding the library in the LD_LIBRARY_PATH search path. Something is
still off in the backend nodes. Try adding the BLCR
PATH/LD_LIBRARY_PATH to your .bashrc instead of the profile.


>
>    ==> I make sure  Master and Slave  have  same Install and  same Path .
>    I  let slave-node  using cr_restart   restart a contextfile ,the
> contextfile checked by Master ,so
>            Blcr  can work;
>    but it still  cannot open shared object file->libcr.so.0:


So BLCR is giving this error?

>
>   (4)  if    I pass  -x LD_LIBRARY_PATH
>  ( local mount )
>         (4-1)My mca-params.conf(In Master )
>  ==> snapc_base_store_in_place=0
>  crs_base_snapshot_dir=/tmp/OmpiStore/local
>  snapc_base_global_snapshot_dir=/tmp/OmpiStore/global
>
>       step 1: mpirun -hostfile Hosts -np 2 -x LD_LIBRARY_PATH -am
> ft-enable-cr ./TEST
>   step 2: ompi-checkpoint -term Pid ( I use another command)
>       step 3:
>    cd  /tmp/OmpiStore/global
>   ==> ompi-restart    Ompi_Pid.ckpt .   (all process
> Only Restart on Master)
>   ==> ompi-restart    --hostfile Host  Ompi_Pid.ckpt .
>  Error-Message:
> root@cuda07:/tmp/OmpiStore/global#
>   ompi-restart --preload -hostfile Hosts ompi_global_snapshot_8873.ckpt/
> Warning: Permanently added the RSA host key for IP address '192.168.1.10' to
> the list of known hosts.
> --
> WARNING: Remote peer ([[37567,0],1]) failed to preload a file.
> Exit Status: 256
> Local  File: /tmp/OmpiStore/global/./opal_snapshot_1.ckpt
> Remote File:
> /tmp/OmpiStore/global/ompi_global_snapshot_8873.ckpt/0/opal_snapshot_1.ckpt
> Command:
>   scp  -r
> cuda07:/tmp/OmpiStore/global/ompi_global_snapshot_8873.ckpt/0/opal_snapshot_1.ckpt
> \
>    /tmp/OmpiStore/global/./opal_snapshot_1.ckpt
>
> Will continue attempting to launch the process(es).
> --
> [cuda08:08899] Error: Unable to access the path [./opal_snapshot_1.ckpt]!
> --
> Error: The filename (opal_snaps

Re: [OMPI users] Ompi-restart failed and process migration

2012-04-24 Thread kidd
Hi ,Thank you For your reply.  
 but I still failed. I must add -x  LD_LIBRARY_PATH 
this is my  All Setting ;
1) Master-Node(cuda07)  &  Slaves Node(cuda08) :
   Configure: 
   ./configure --prefix=/root/kidd_openMPI  --with-ft=cr  --enable-ft-thread  
--with-blcr=/usr/local/BLCR  
   --with-blcr-libdir=/usr/local/BLCR/lib  --enable-mpirun-prefix-by-default 
   --enable-static --enable-shared  --enable-opal-progress-threads; make ; make 
install;

  (2) Path && LD_PATH: 
    #In /etc/profile
 ==>export PATH=$PATH:/usr/local/BLCR/bin ;
 ==>export  LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/BLCR/lib
   #In ~/.bashrc
    ==>export PATH=$PATH:/root/kidd_openMPI/bin
    ==>export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/root/kidd_openMPI/lib
   
   (3) Compiler && Running:
  ==> ~/kidd_openMPI/NBody_TEST#  mpicc -o  TEST -DDEFSIZE=5000  \
              -DDEF_PROC=2 MPINbodyOMP.c

  ==>   root@cuda07:~/kidd_openMPI/NBody_TEST# mpirun -hostfile Hosts -np 2 
TEST
  TEST: error while loading shared libraries: libcr.so.0: cannot open shared 
object file: No such file or directory
   
   ==> I make sure  Master and Slave  have  same Install and  same Path . 
   I  let slave-node  using cr_restart   restart a contextfile ,the 
contextfile checked by Master ,so 
           Blcr  can work; 
   but it still  cannot open shared object file->libcr.so.0:

  (4)  if    I pass  -x LD_LIBRARY_PATH 
 ( local mount )
        (4-1)My mca-params.conf(In Master )
 ==> snapc_base_store_in_place=0
 crs_base_snapshot_dir=/tmp/OmpiStore/local
 snapc_base_global_snapshot_dir=/tmp/OmpiStore/global
  
      step 1: mpirun -hostfile Hosts -np 2 -x LD_LIBRARY_PATH -am 
ft-enable-cr ./TEST
  step 2: ompi-checkpoint -term Pid ( I use another command)
      step 3:  
   cd  /tmp/OmpiStore/global
  ==> ompi-restart    Ompi_Pid.ckpt .   (all process    
Only Restart on Master)
  ==> ompi-restart    --hostfile Host  Ompi_Pid.ckpt .
 Error-Message:
root@cuda07:/tmp/OmpiStore/global#
  ompi-restart --preload -hostfile Hosts ompi_global_snapshot_8873.ckpt/
Warning: Permanently added the RSA host key for IP address '192.168.1.10' to 
the list of known hosts.
--
WARNING: Remote peer ([[37567,0],1]) failed to preload a file.
Exit Status: 256
Local  File: /tmp/OmpiStore/global/./opal_snapshot_1.ckpt
Remote File: 
/tmp/OmpiStore/global/ompi_global_snapshot_8873.ckpt/0/opal_snapshot_1.ckpt
Command:
  scp  -r  
cuda07:/tmp/OmpiStore/global/ompi_global_snapshot_8873.ckpt/0/opal_snapshot_1.ckpt
 \
   /tmp/OmpiStore/global/./opal_snapshot_1.ckpt 

Will continue attempting to launch the process(es).
--
[cuda08:08899] Error: Unable to access the path [./opal_snapshot_1.ckpt]!
--
Error: The filename (opal_snapshot_1.ckpt) is invalid because either you have 
not provided a filename
   or provided an invalid filename.
   Please see --help for usage.
--
I am 0 loop=40  in #pragma  time1=446.558860 
^Cmpirun: killing job...
/*---*/
 (5)A couple solutions:
> - have the PATH and LD_LIBRARY_PATH set the same on all nodes
> - have ompi-restart pass the -x parameter to the underlying mpirun by
> using the -mpirun_opts command line switch:
>   ompi-restart --mpirun_opts "-x LD_LIBRARY_PATH" ..
   
     How to Using   --mpirun_opts ? 
 this is my command ==> 
 ompi-restart --mpirun_opts  -x  LD_LIBRARY_PATH  -hostfile Hosts \
 ompi_global_snapshot_8873.ckpt/
 but it is Error.

 thanks.  



 寄件者: Josh Hursey <jjhur...@open-mpi.org>
收件者: Open MPI Users <us...@open-mpi.org> 
寄件日期: 2012/4/24 (週二) 3:23 AM
主旨: Re: [OMPI users] Ompi-restart failed and process migration
 
On Mon, Apr 23, 2012 at 2:45 PM, kidd <q19860...@yahoo.com.tw> wrote:
> Hi ,Thank you For your reply.
>
> I have some problems:
> (1)
> Now ,In the my platform , all nodes have the same path and LD_LIBRARY_PATH.
>  I set in .bashrc
> //
> #BLCR
> export PATH=$PATH:/usr/local/BLCR/bin
> export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/BLCR/lib
> #openMPI
> export PATH=$PATH:/root/kidd_openMPI/bin
> export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/root/kidd_openMPI/lib
> /---/
> but ,when I  runnin

Re: [OMPI users] Ompi-restart failed and process migration

2012-04-23 Thread Josh Hursey
On Mon, Apr 23, 2012 at 2:45 PM, kidd <q19860...@yahoo.com.tw> wrote:
> Hi ,Thank you For your reply.
>
> I have some problems:
> (1)
> Now ,In the my platform , all nodes have the same path and LD_LIBRARY_PATH.
>  I set in .bashrc
> //
> #BLCR
> export PATH=$PATH:/usr/local/BLCR/bin
> export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/BLCR/lib
> #openMPI
> export PATH=$PATH:/root/kidd_openMPI/bin
> export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/root/kidd_openMPI/lib
> /---/
> but ,when I  running  mpirun  , I have to add  " -x  LD_LIBRARY_PATH" ,or
> it can't  run
>  example:  mpirun -hostfile hosts  -np  2  ./TEST .
>  Error Message==>
> ./TEST: error while loading shared libraries: libcr.so.0: cannot open shared
> object file: No such file or directory

It sounds like something is still not quite right with your
environment and system setup. If you have set the PATH and
LD_LIBRARY_PATH appropriately on all nodes then you should not have to
pass the "-x LD_LIBRARY_PATH" option to mpirun. Additionally, the
error you are seeing is from BLCR. That error seems to indicate that
BLCR is not installed correctly on all nodes.

Some things to look into (in this order):
 1) Make sure that you have BLCR and Open MPI installed in the same
location on all machines.
 2) Make sure that BLCR works on all machines by checkpointing and
restarting a single process program
 3) Make sure that Open MPI works on all machines -without-
checkpointing, and without passing the -x option.
 4) Checkpoint/restart an MPI job


>  (2)  BLCR need to unify linux-kernel  of all the Node ?
>    Now ,I reset all  Node.(using Ubuntu 10.04)

I do not understand what you are trying to ask here. Please rephrase.


>  (3)
>       Now , My porgram using  DLL . I implements some DLL  ,MPI-Program
> calls DLLs .
>   Ompi can check/Restart  Program contains  DLL ?

I do not understand what you are trying to ask here. Please rephrase.

-- Josh


> 
>
> 
> 寄件者: Josh Hursey <jjhur...@open-mpi.org>
> 收件者: Open MPI Users <us...@open-mpi.org>
> 寄件日期: 2012/4/23 (週一) 10:51 PM
> 主旨: Re: [OMPI users] Ompi-restart failed and process migration
>
> I wonder if the LD_LIBRARY_PATH is not being set properly upon
> restart. In your mpirun you pass the '-x LD_LIBRARY_PATH'.
> ompi-restart will not pass that variable along for you, so if you are
> using that to set the BLCR path this might be your problem.
>
> A couple solutions:
> - have the PATH and LD_LIBRARY_PATH set the same on all nodes
> - have ompi-restart pass the -x parameter to the underlying mpirun by
> using the -mpirun_opts command line switch:
>   ompi-restart --mpirun_opts "-x LD_LIBRARY_PATH" ...
>
> Yes. ompi-restart will let you checkpoint a process on one node and
> restart it on another. You will have to restart the whole application
> since the ompi-migration operation is not available in the 1.5 series.
>
> -- Josh
>
> On Sat, Apr 21, 2012 at 4:11 AM, kidd <q19860...@yahoo.com.tw> wrote:
>> Hi all,
>> I have Some problems,I wana check/Restart Multiple process on 2 node.
>>
>>  My environment:
>>  BLCR= 0.8.4   , openMPI= 1.5.5  , OS = ubuntu 11.04
>> I have 2 Node :
>>  N05(Master ,it have NFS shared file system),N07(slave
>>  ,mount Master-Node).
>>
>>  My configure format=./configure --prefix=/root/kidd_openMPI
>>  --with-ft=cr --enable-ft-thread  --with-blcr=/usr/local/BLCR
>>  --with-blcr-libdir=/usr/local/BLCR/lib --enable-mpirun-prefix-by-default
>>  --enable-static --enable-shared --enable-opal-multi-threads;
>>
>>  I had also set  ~/.openmpi/mca-params.conf->
>>     crs_base_snapshot_dir=/root/kidd_openMPI/Tmp
>>     snapc_base_global_snapshot_dir=/root/kidd_openMPI/checkpoints.
>>
>> the dir->kidd_openMPI is my nfs shared dir.
>>
>>  My Command :
>>  1. mpicc -o TEST -DDEFSIZE=3000 -DDEF_PROC=2 -fopenmp MPIMatrix.c
>>
>>   2. mpirun -hostfile Hosts -am ft-enable-cr -x LD_LIBRARY_PATH
>>      -np 2 ./TEST .
>>
>>  I can restart process-0 on Master,but process-1 on N07 was failed.
>>
>>  I checked my Node,it does not install the prelink,
>>  so the error(restart-failed) is caused by other reasons.
>>
>>  Error Message-->
>>
>> --
>>   root@cuda05:~/kidd_openMPI/checkpoints#
>>  ompi-restart -hostfile Hosts ompi_global_snapshot_2892.ckpt/
>>
>&

Re: [OMPI users] Ompi-restart failed and process migration

2012-04-23 Thread kidd
Hi ,Thank you For your reply.
  
I have some problems:
(1)
Now ,In the my platform , all nodes have the same pathand LD_LIBRARY_PATH.
 I set in .bashrc  
//
#BLCR
export PATH=$PATH:/usr/local/BLCR/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/BLCR/lib
#openMPI
export PATH=$PATH:/root/kidd_openMPI/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/root/kidd_openMPI/lib

/---/
but ,when I  running  mpirun  , I have to add  " -x  LD_LIBRARY_PATH" ,or  it 
can't  run
 example:  mpirun -hostfile hosts  -np  2  ./TEST .
 Error Message==> 
./TEST: error while loading shared libraries: libcr.so.0: cannot open shared 
object file: No such file or directory
 (2)  BLCR need to unify linux-kernel  of all the Node ?
   Now ,I reset all  Node.(using Ubuntu 10.04)

 (3) 
      Now , My porgram using  DLL . I implements some DLL  ,MPI-Program calls 
DLLs .  
  Ompi can check/Restart  Program contains  DLL ? 





 寄件者: Josh Hursey <jjhur...@open-mpi.org>
收件者: Open MPI Users <us...@open-mpi.org> 
寄件日期: 2012/4/23 (週一) 10:51 PM
主旨: Re: [OMPI users] Ompi-restart failed and process migration
 
I wonder if the LD_LIBRARY_PATH is not being set properly upon
restart. In your mpirun you pass the '-x LD_LIBRARY_PATH'.
ompi-restart will not pass that variable along for you, so if you are
using that to set the BLCR path this might be your problem.

A couple solutions:
- have the PATH and LD_LIBRARY_PATH set the same on all nodes
- have ompi-restart pass the -x parameter to the underlying mpirun by
using the -mpirun_opts command line switch:
   ompi-restart --mpirun_opts "-x LD_LIBRARY_PATH" ...

Yes. ompi-restart will let you checkpoint a process on one node and
restart it on another. You will have to restart the whole application
since the ompi-migration operation is not available in the 1.5 series.

-- Josh

On Sat, Apr 21, 2012 at 4:11 AM, kidd <q19860...@yahoo.com.tw> wrote:
> Hi all,
> I have Some problems,I wana check/Restart Multiple process on 2 node.
>
>  My environment:
>  BLCR= 0.8.4   , openMPI= 1.5.5  , OS = ubuntu 11.04
> I have 2 Node :
>  N05(Master ,it have NFS shared file system),N07(slave
>  ,mount Master-Node).
>
>  My configure format=./configure --prefix=/root/kidd_openMPI
>  --with-ft=cr --enable-ft-thread  --with-blcr=/usr/local/BLCR
>  --with-blcr-libdir=/usr/local/BLCR/lib --enable-mpirun-prefix-by-default
>  --enable-static --enable-shared --enable-opal-multi-threads;
>
>   I had also set  ~/.openmpi/mca-params.conf->
>     crs_base_snapshot_dir=/root/kidd_openMPI/Tmp
>     snapc_base_global_snapshot_dir=/root/kidd_openMPI/checkpoints.
>
> the dir->kidd_openMPI is my nfs shared dir.
>
>  My Command :
>   1. mpicc -o TEST -DDEFSIZE=3000 -DDEF_PROC=2 -fopenmp MPIMatrix.c
>
>   2. mpirun -hostfile Hosts -am ft-enable-cr -x LD_LIBRARY_PATH
>      -np 2 ./TEST .
>
>   I can restart process-0 on Master,but process-1 on N07 was failed.
>
>   I checked my Node,it does not install the prelink,
>   so the error(restart-failed) is caused by other reasons.
>
>   Error Message-->
>  --
>   root@cuda05:~/kidd_openMPI/checkpoints#
>   ompi-restart -hostfile Hosts ompi_global_snapshot_2892.ckpt/
>  --
>     Error: BLCR was not able to restart the process because exec failed.
>      Check the installation of BLCR on all of the machines in your
>      system. The following information may be of help:
>   Return Code : -1
>   BLCR Restart Command : cr_restart
>   Restart Command Line : cr_restart
>  /root/kidd_openMPI/checkpoints/ompi_global_snapshot_2892.ckpt/0/
>  opal_snapshot_1.ckpt/ompi_blcr_context.2704
>  --
>  --
>  Error: Unable to obtain the proper restart command to restart from the
>     checkpoint file (opal_snapshot_1.ckpt). Returned -1.
>     Check the installation of the blcr checkpoint/restart service
>     on all of the machines in your system.
>  ###
>  problem 2: I wana let MPI-process can migration to another Node.
>          if Ompi-Restart  Multiple-Node can be successful.
>          Can restart in another new node, rather than the original node?
>                        example:
>          checkpoint (node1,node2,node3),then restart(node1,node3,node4).
>          or just restart

Re: [OMPI users] Ompi-restart failed and process migration

2012-04-23 Thread Josh Hursey
I wonder if the LD_LIBRARY_PATH is not being set properly upon
restart. In your mpirun you pass the '-x LD_LIBRARY_PATH'.
ompi-restart will not pass that variable along for you, so if you are
using that to set the BLCR path this might be your problem.

A couple solutions:
 - have the PATH and LD_LIBRARY_PATH set the same on all nodes
 - have ompi-restart pass the -x parameter to the underlying mpirun by
using the -mpirun_opts command line switch:
   ompi-restart --mpirun_opts "-x LD_LIBRARY_PATH" ...

Yes. ompi-restart will let you checkpoint a process on one node and
restart it on another. You will have to restart the whole application
since the ompi-migration operation is not available in the 1.5 series.

-- Josh

On Sat, Apr 21, 2012 at 4:11 AM, kidd  wrote:
> Hi all,
> I have Some problems,I wana check/Restart Multiple process on 2 node.
>
>  My environment:
>  BLCR= 0.8.4   , openMPI= 1.5.5  , OS = ubuntu 11.04
> I have 2 Node :
>  N05(Master ,it have NFS shared file system),N07(slave
>  ,mount Master-Node).
>
>  My configure format=./configure --prefix=/root/kidd_openMPI
>  --with-ft=cr --enable-ft-thread  --with-blcr=/usr/local/BLCR
>  --with-blcr-libdir=/usr/local/BLCR/lib --enable-mpirun-prefix-by-default
>  --enable-static --enable-shared --enable-opal-multi-threads;
>
>   I had also set  ~/.openmpi/mca-params.conf->
>     crs_base_snapshot_dir=/root/kidd_openMPI/Tmp
>     snapc_base_global_snapshot_dir=/root/kidd_openMPI/checkpoints.
>
> the dir->kidd_openMPI is my nfs shared dir.
>
>  My Command :
>   1. mpicc -o TEST -DDEFSIZE=3000 -DDEF_PROC=2 -fopenmp MPIMatrix.c
>
>   2. mpirun -hostfile Hosts -am ft-enable-cr -x LD_LIBRARY_PATH
>  -np 2 ./TEST .
>
>   I can restart process-0 on Master,but process-1 on N07 was failed.
>
>   I checked my Node,it does not install the prelink,
>   so the error(restart-failed) is caused by other reasons.
>
>   Error Message-->
>  --
>   root@cuda05:~/kidd_openMPI/checkpoints#
>   ompi-restart -hostfile Hosts ompi_global_snapshot_2892.ckpt/
>  --
>     Error: BLCR was not able to restart the process because exec failed.
>      Check the installation of BLCR on all of the machines in your
>      system. The following information may be of help:
>   Return Code : -1
>   BLCR Restart Command : cr_restart
>   Restart Command Line : cr_restart
>  /root/kidd_openMPI/checkpoints/ompi_global_snapshot_2892.ckpt/0/
>  opal_snapshot_1.ckpt/ompi_blcr_context.2704
>  --
>  --
>  Error: Unable to obtain the proper restart command to restart from the
>     checkpoint file (opal_snapshot_1.ckpt). Returned -1.
>     Check the installation of the blcr checkpoint/restart service
>     on all of the machines in your system.
>  ###
>  problem 2: I wana let MPI-process can migration to another Node.
>  if Ompi-Restart  Multiple-Node can be successful.
>  Can restart in another new node, rather than the original node?
>example:
>  checkpoint (node1,node2,node3),then restart(node1,node3,node4).
>  or just restart(node1,node3(2-process) ).
>
>    Please help me , thanks .
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey



[OMPI users] Ompi-restart failed and process migration

2012-04-21 Thread kidd
Hi all,
I have Some problems,I wana check/Restart Multiple process on 2 node. My 
environment: BLCR= 0.8.4   , openMPI= 1.5.5  , OS = ubuntu 11.04
I have 2 Node :
 N05(Master ,it have NFS shared file system),N07(slave
 ,mount Master-Node).
 
 My configure format=./configure --prefix=/root/kidd_openMPI 
--with-ft=cr --enable-ft-thread  --with-blcr=/usr/local/BLCR  
 --with-blcr-libdir=/usr/local/BLCR/lib --enable-mpirun-prefix-by-default
 --enable-static --enable-shared --enable-opal-multi-threads;

I had also set  ~/.openmpi/mca-params.conf->
    crs_base_snapshot_dir=/root/kidd_openMPI/Tmp
    snapc_base_global_snapshot_dir=/root/kidd_openMPI/checkpoints.

the dir->kidd_openMPI is my nfs shared dir.

 My Command :
1. mpicc -o TEST -DDEFSIZE=3000 -DDEF_PROC=2 -fopenmp MPIMatrix.c

 2. mpirun -hostfile Hosts -am ft-enable-cr -x LD_LIBRARY_PATH
 -np 2 ./TEST .
   
  I can restart process-0 on Master,but process-1 on N07 was failed. I checked 
my Node,it does not install the prelink,so the error(restart-failed) is caused 
by other reasons. Error Message--> 
--  
root@cuda05:~/kidd_openMPI/checkpoints# 
ompi-restart -hostfile Hosts ompi_global_snapshot_2892.ckpt/
 --
    Error: BLCR was not able to restart the process because exec failed.
     Check the installation of BLCR on all of the machines in your
     system. The following information may be of help:
  Return Code : -1
  BLCR Restart Command : cr_restart
  Restart Command Line : cr_restart
 /root/kidd_openMPI/checkpoints/ompi_global_snapshot_2892.ckpt/0/
 opal_snapshot_1.ckpt/ompi_blcr_context.2704
 --
 --
 Error: Unable to obtain the proper restart command to restart from the
    checkpoint file (opal_snapshot_1.ckpt). Returned -1.
    Check the installation of the blcr checkpoint/restart service
    on all of the machines in your system.
 ###
problem 2: I wana let MPI-process can migration to another Node.
if Ompi-Restart  Multiple-Node can be successful. 
Can restart in another new node, rather than the original node?
example:
checkpoint (node1,node2,node3),then restart(node1,node3,node4).
or just restart(node1,node3(2-process) ). 
   Please help me , thanks .

Re: [OMPI users] ompi-restart failed && ompi-migrate

2012-04-17 Thread kidd
Hello ,thank your reply,but I still can't  Ompi-Restart  Multiple-Node.
I checked my Node(ubuntu11.04  && openmpi1.5.5), they did not install the  
prelink.
Whether there are other reasons failed to ompi-restart?
ps: 

if Ompi-Restart  Multiple-Node can be successful. 

Can start in another new node, rather than the original node?
example : checkpoint (node1,node2) ,then  restart (node1,node3)



 寄件者: Josh Hursey <jjhur...@open-mpi.org>
收件者: Open MPI Users <us...@open-mpi.org> 
寄件日期: 2012/4/11 (週三) 8:36 PM
主旨: Re: [OMPI users] ompi-restart failed && ompi-migrate
 
The 1.5 series does not support process migration, so there is no
ompi-migrate option there. This was only contributed to the trunk (1.7
series). However, changes to the runtime environment over the past few
months have broken this functionality. It is currently unclear when
this will be repaired. We hope to have it fixed and functional again
before the first release of the 1.7 series.

As far as your problem with ompi-restart have you checked the prelink
option on all of your nodes, per:
  https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink

-- Josh

On Tue, Apr 10, 2012 at 11:14 PM, kidd <q19860...@yahoo.com.tw> wrote:
> Hello !
> I had some  problems .
> This is My environment
>    BLCR= 0.8.4   , openMPI= 1.5.5  , OS= ubuntu 11.04
>    I have 2 Node : cuda05(Master ,it have NFS  file system)  , cuda07(slave
> ,mount Master)
>
>    I had also set  ~/.openmpi/mca-params.conf->
>  crs_base_snapshot_dir=/root/kidd_openMPI/Tmp
>  snapc_base_global_snapshot_dir=/root/kidd_openMPI/checkpoints
>
>   my configure format=
> ./configure --prefix=/root/kidd_openMPI --with-ft=cr --enable-ft-thread
>  --with-blcr=/usr/local/BLCR  --with-blcr-libdir=/usr/local/BLCR/lib
> --enable-mpirun-prefix-by-default
>  --enable-static --enable-shared  --enable-opal-multi-threads;
>
> problem 1:  ompi-restart  on multiple Node
>   command 01: mpirun -hostfile  Hosts -am ft-enable-cr  -x  LD_LIBRARY_PATH
> -np 2  ./TEST
>   command 02: ompi-restart  ompi_global_snapshot_2892.ckpt
>   -> I can checkpoint 2 process on multiples nodes ,but when I restart
> ,it can only restart on Master-Node.
>
>      command 03 : ompi-restart  -hostfile Hosts
> ompi_global_snapshot_2892.ckpt
>     ->Error Message .   I make sure BLCR  is OK.
> 
>
> --
>     root@cuda05:~/kidd_openMPI/checkpoints# ompi-restart -hostfile Hosts
> ompi_global_snapshot_2892.ckpt/
>
> --
>    Error: BLCR was not able to restart the process because exec failed.
>     Check the installation of BLCR on all of the machines in your
>    system. The following information may be of help:
>  Return Code : -1
>  BLCR Restart Command : cr_restart
>  Restart Command Line : cr_restart
> /root/kidd_openMPI/checkpoints/ompi_global_snapshot_2892.ckpt/0/
opal_snapshot_1.ckpt/ompi_blcr_context.2704
> --
> --
> Error: Unable to obtain the proper restart command to restart from the
>    checkpoint file (opal_snapshot_1.ckpt). Returned -1.
>    Check the installation of the blcr checkpoint/restart service
>    on all of the machines in your system.essage
> 
>  problem 2: ompi-migrate i can't find .   How to use ompi-migrate ?
>
>   Please help me , thanks .
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] ompi-restart failed && ompi-migrate

2012-04-11 Thread kidd
Hello !
I check  my OS(ubuntu 11)  . it not install prelink . Are there other reasons? 
(ompi-restart)
  thanks . 



 寄件者: Josh Hursey <jjhur...@open-mpi.org>
收件者: Open MPI Users <us...@open-mpi.org> 
寄件日期: 2012/4/11 (週三) 8:36 PM
主旨: Re: [OMPI users] ompi-restart failed && ompi-migrate
 
The 1.5 series does not support process migration, so there is no
ompi-migrate option there. This was only contributed to the trunk (1.7
series). However, changes to the runtime environment over the past few
months have broken this functionality. It is currently unclear when
this will be repaired. We hope to have it fixed and functional again
before the first release of the 1.7 series.

As far as your problem with ompi-restart have you checked the prelink
option on all of your nodes, per:
  https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink

-- Josh

On Tue, Apr 10, 2012 at 11:14 PM, kidd <q19860...@yahoo.com.tw> wrote:
> Hello !
> I had some  problems .
> This is My environment
>    BLCR= 0.8.4   , openMPI= 1.5.5  , OS= ubuntu 11.04
>    I have 2 Node : cuda05(Master ,it have NFS  file system)  , cuda07(slave
> ,mount Master)
>
>    I had also set  ~/.openmpi/mca-params.conf->
>  crs_base_snapshot_dir=/root/kidd_openMPI/Tmp
>  snapc_base_global_snapshot_dir=/root/kidd_openMPI/checkpoints
>
>   my configure format=
> ./configure --prefix=/root/kidd_openMPI --with-ft=cr --enable-ft-thread
>  --with-blcr=/usr/local/BLCR  --with-blcr-libdir=/usr/local/BLCR/lib
> --enable-mpirun-prefix-by-default
>  --enable-static --enable-shared  --enable-opal-multi-threads;
>
> problem 1:  ompi-restart  on multiple Node
>   command 01: mpirun -hostfile  Hosts -am ft-enable-cr  -x  LD_LIBRARY_PATH
> -np 2  ./TEST
>   command 02: ompi-restart  ompi_global_snapshot_2892.ckpt
>   -> I can checkpoint 2 process on multiples nodes ,but when I restart
> ,it can only restart on Master-Node.
>
>      command 03 : ompi-restart  -hostfile Hosts
> ompi_global_snapshot_2892.ckpt
>     ->Error Message .   I make sure BLCR  is OK.
> 
>
> --
>     root@cuda05:~/kidd_openMPI/checkpoints# ompi-restart -hostfile Hosts
> ompi_global_snapshot_2892.ckpt/
>
> --
>    Error: BLCR was not able to restart the process because exec failed.
>     Check the installation of BLCR on all of the machines in your
>    system. The following information may be of help:
>  Return Code : -1
>  BLCR Restart Command : cr_restart
>  Restart Command Line : cr_restart
> /root/kidd_openMPI/checkpoints/ompi_global_snapshot_2892.ckpt/0/opal_snapshot_1.ckpt/ompi_blcr_context.2704
> --
> --
> Error: Unable to obtain the proper restart command to restart from the
>    checkpoint file (opal_snapshot_1.ckpt). Returned -1.
>    Check the installation of the blcr checkpoint/restart service
>    on all of the machines in your system.essage
> 
>  problem 2: ompi-migrate i can't find .   How to use ompi-migrate ?
>
>   Please help me , thanks .
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] ompi-restart failed && ompi-migrate

2012-04-11 Thread Josh Hursey
The 1.5 series does not support process migration, so there is no
ompi-migrate option there. This was only contributed to the trunk (1.7
series). However, changes to the runtime environment over the past few
months have broken this functionality. It is currently unclear when
this will be repaired. We hope to have it fixed and functional again
before the first release of the 1.7 series.

As far as your problem with ompi-restart have you checked the prelink
option on all of your nodes, per:
  https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#prelink

-- Josh

On Tue, Apr 10, 2012 at 11:14 PM, kidd  wrote:
> Hello !
> I had some  problems .
> This is My environment
>    BLCR= 0.8.4   , openMPI= 1.5.5  , OS= ubuntu 11.04
>    I have 2 Node : cuda05(Master ,it have NFS  file system)  , cuda07(slave
> ,mount Master)
>
>    I had also set  ~/.openmpi/mca-params.conf->
>  crs_base_snapshot_dir=/root/kidd_openMPI/Tmp
>  snapc_base_global_snapshot_dir=/root/kidd_openMPI/checkpoints
>
>   my configure format=
> ./configure --prefix=/root/kidd_openMPI --with-ft=cr --enable-ft-thread
>  --with-blcr=/usr/local/BLCR  --with-blcr-libdir=/usr/local/BLCR/lib
> --enable-mpirun-prefix-by-default
>  --enable-static --enable-shared  --enable-opal-multi-threads;
>
> problem 1:  ompi-restart  on multiple Node
>   command 01: mpirun -hostfile  Hosts -am ft-enable-cr  -x  LD_LIBRARY_PATH
> -np 2  ./TEST
>   command 02: ompi-restart  ompi_global_snapshot_2892.ckpt
>   -> I can checkpoint 2 process on multiples nodes ,but when I restart
> ,it can only restart on Master-Node.
>
>      command 03 : ompi-restart  -hostfile Hosts
> ompi_global_snapshot_2892.ckpt
>     ->Error Message .   I make sure BLCR  is OK.
> 
>
> --
>     root@cuda05:~/kidd_openMPI/checkpoints# ompi-restart -hostfile Hosts
> ompi_global_snapshot_2892.ckpt/
>
> --
>    Error: BLCR was not able to restart the process because exec failed.
>     Check the installation of BLCR on all of the machines in your
>    system. The following information may be of help:
>  Return Code : -1
>  BLCR Restart Command : cr_restart
>  Restart Command Line : cr_restart
> /root/kidd_openMPI/checkpoints/ompi_global_snapshot_2892.ckpt/0/opal_snapshot_1.ckpt/ompi_blcr_context.2704
> --
> --
> Error: Unable to obtain the proper restart command to restart from the
>    checkpoint file (opal_snapshot_1.ckpt). Returned -1.
>    Check the installation of the blcr checkpoint/restart service
>    on all of the machines in your system.essage
> 
>  problem 2: ompi-migrate i can't find .   How to use ompi-migrate ?
>
>   Please help me , thanks .
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey



[OMPI users] ompi-restart failed && ompi-migrate

2012-04-11 Thread kidd
Hello !  
I had some  problems . 
This is My environment 
   BLCR= 0.8.4   , openMPI= 1.5.5  , OS= ubuntu 11.04
   I have 2 Node : cuda05(Master ,it have NFS  file system)  , cuda07(slave 
,mount Master)

   I had also set  ~/.openmpi/mca-params.conf->
 crs_base_snapshot_dir=/root/kidd_openMPI/Tmp
 snapc_base_global_snapshot_dir=/root/kidd_openMPI/checkpoints

  my configure format=
./configure --prefix=/root/kidd_openMPI --with-ft=cr --enable-ft-thread  
 --with-blcr=/usr/local/BLCR  --with-blcr-libdir=/usr/local/BLCR/lib 
--enable-mpirun-prefix-by-default 
 --enable-static --enable-shared  --enable-opal-multi-threads;

problem 1:  ompi-restart  on multiple Node
  command 01: mpirun -hostfile  Hosts -am ft-enable-cr  -x  LD_LIBRARY_PATH  
-np 2  ./TEST     
  command 02: ompi-restart  ompi_global_snapshot_2892.ckpt
  -> I can checkpoint 2 process on multiples nodes ,but when I restart ,it 
can only restart on Master-Node.   
   
     command 03 : ompi-restart  -hostfile Hosts ompi_global_snapshot_2892.ckpt
    ->Error Message .   I make sure BLCR  is OK.

  
 --
    root@cuda05:~/kidd_openMPI/checkpoints# ompi-restart -hostfile Hosts 
ompi_global_snapshot_2892.ckpt/
   --
   Error: BLCR was not able to restart the process because exec failed.
    Check the installation of BLCR on all of the machines in your
   system. The following information may be of help:
 Return Code : -1
 BLCR Restart Command : cr_restart
 Restart
 Command Line : cr_restart 
/root/kidd_openMPI/checkpoints/ompi_global_snapshot_2892.ckpt/0/opal_snapshot_1.ckpt/ompi_blcr_context.2704
--
--
Error: Unable to
 obtain the proper restart command to restart from the 
   checkpoint file (opal_snapshot_1.ckpt). Returned -1.
   Check the installation of the blcr checkpoint/restart service
   on all of the machines in your system.essage

 problem 2: ompi-migrate i can't find .   How to use ompi-migrate ?
  Please help me , thanks .

Re: [OMPI users] ompi-restart failed

2010-07-16 Thread Josh Hursey
Open MPI can restart multi-threaded applications on any number of nodes (I do 
this routinely in testing).

If you are still experiencing this problem (sorry for the late reply), can you 
send me the MCA parameters that you are using, command line, and a backtrace 
from the corefile generated by the application?

Those bits of information will help me narrow down what might be going wrong. 
You might also try testing against the v1.5 series or the development trunk to 
make sure that the problem is not just v1.4 specific.

-- Josh

On Jun 14, 2010, at 2:47 AM, Nguyen Toan wrote:

> Hi all,
> I finally figured out the answer. I just put the parameter "-machinefile 
> host" in the "ompi-restart" command and it restarted correctly. So is it 
> unable to restart multi-threaded application on 1 node in OpenMPI?
> 
> Nguyen Toan 
> 
> On Tue, Jun 8, 2010 at 12:07 AM, Nguyen Toan  wrote:
> Sorry, I just want to add 2 more things:
> + I tried configure with and without --enable-ft-thread but nothing changed
> + I also applied this patch for OpenMPI here and reinstalled but I got the 
> same error
> https://svn.open-mpi.org/trac/ompi/raw-attachment/ticket/2139/v1.4-preload-part1.diff
> 
> Somebody helps? Thank you very much.
> 
> Nguyen Toan
> 
> 
> On Mon, Jun 7, 2010 at 11:51 PM, Nguyen Toan  wrote:
> Hello everyone,
> 
> I'm using OpenMPI 1.4.2 with BLCR 0.8.2 to test checkpointing on 2 nodes but 
> it failed to restart (Segmentation fault).
> Here are the details concerning my problem:
> 
> + OS: Centos 5.4
> + OpenMPI configure:
> ./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads \
> --with-blcr=/home/nguyen/opt/blcr 
> --with-blcr-libdir=/home/nguyen/opt/blcr/lib \
> --prefix=/home/nguyen/opt/openmpi \
> --enable-mpirun-prefix-by-default
> + mpirun -am ft-enable-cr -machinefile host ./test
> 
> I checkpointed the test program using "ompi-checkpoint -v -s PID" and the 
> checkpoint file was created successfully. However it failed to restart using 
> ompi-restart:
> "mpirun noticed that process rank 0 with PID 21242 on node rc014.local exited 
> on signal 11 (Segmentation fault)"
> 
> Did I miss something in the installation of OpenMPI?
>  
> Regards,
> Nguyen Toan
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] ompi-restart failed

2010-06-14 Thread Nguyen Toan
Hi all,
I finally figured out the answer. I just put the parameter "-machinefile
host" in the "ompi-restart" command and it restarted correctly. So is it
unable to restart multi-threaded application on 1 node in OpenMPI?

Nguyen Toan

On Tue, Jun 8, 2010 at 12:07 AM, Nguyen Toan wrote:

> Sorry, I just want to add 2 more things:
> + I tried configure with and without --enable-ft-thread but nothing changed
> + I also applied this patch for OpenMPI here and reinstalled but I got the
> same error
>
> https://svn.open-mpi.org/trac/ompi/raw-attachment/ticket/2139/v1.4-preload-part1.diff
>
> Somebody helps? Thank you very much.
>
> Nguyen Toan
>
>
> On Mon, Jun 7, 2010 at 11:51 PM, Nguyen Toan wrote:
>
>> Hello everyone,
>>
>> I'm using OpenMPI 1.4.2 with BLCR 0.8.2 to test checkpointing on 2 nodes
>> but it failed to restart (Segmentation fault).
>> Here are the details concerning my problem:
>>
>> + OS: Centos 5.4
>> + OpenMPI configure:
>> ./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads \
>> --with-blcr=/home/nguyen/opt/blcr
>> --with-blcr-libdir=/home/nguyen/opt/blcr/lib \
>> --prefix=/home/nguyen/opt/openmpi \
>> --enable-mpirun-prefix-by-default
>> + mpirun -am ft-enable-cr -machinefile host ./test
>>
>> I checkpointed the test program using "ompi-checkpoint -v -s PID" and the
>> checkpoint file was created successfully. However it failed to restart using
>> ompi-restart:
>> *"mpirun noticed that process rank 0 with PID 21242 on node rc014.local
>> exited on signal 11 (Segmentation fault)"
>> *
>> Did I miss something in the installation of OpenMPI?
>>
>> Regards,
>> Nguyen Toan
>>
>
>


Re: [OMPI users] ompi-restart failed

2010-06-07 Thread Nguyen Toan
Sorry, I just want to add 2 more things:
+ I tried configure with and without --enable-ft-thread but nothing changed
+ I also applied this patch for OpenMPI here and reinstalled but I got the
same error
https://svn.open-mpi.org/trac/ompi/raw-attachment/ticket/2139/v1.4-preload-part1.diff

Somebody helps? Thank you very much.

Nguyen Toan

On Mon, Jun 7, 2010 at 11:51 PM, Nguyen Toan wrote:

> Hello everyone,
>
> I'm using OpenMPI 1.4.2 with BLCR 0.8.2 to test checkpointing on 2 nodes
> but it failed to restart (Segmentation fault).
> Here are the details concerning my problem:
>
> + OS: Centos 5.4
> + OpenMPI configure:
> ./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads \
> --with-blcr=/home/nguyen/opt/blcr
> --with-blcr-libdir=/home/nguyen/opt/blcr/lib \
> --prefix=/home/nguyen/opt/openmpi \
> --enable-mpirun-prefix-by-default
> + mpirun -am ft-enable-cr -machinefile host ./test
>
> I checkpointed the test program using "ompi-checkpoint -v -s PID" and the
> checkpoint file was created successfully. However it failed to restart using
> ompi-restart:
> *"mpirun noticed that process rank 0 with PID 21242 on node rc014.local
> exited on signal 11 (Segmentation fault)"
> *
> Did I miss something in the installation of OpenMPI?
>
> Regards,
> Nguyen Toan
>


[OMPI users] ompi-restart failed

2010-06-07 Thread Nguyen Toan
Hello everyone,

I'm using OpenMPI 1.4.2 with BLCR 0.8.2 to test checkpointing on 2 nodes but
it failed to restart (Segmentation fault).
Here are the details concerning my problem:

+ OS: Centos 5.4
+ OpenMPI configure:
./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads \
--with-blcr=/home/nguyen/opt/blcr
--with-blcr-libdir=/home/nguyen/opt/blcr/lib \
--prefix=/home/nguyen/opt/openmpi \
--enable-mpirun-prefix-by-default
+ mpirun -am ft-enable-cr -machinefile host ./test

I checkpointed the test program using "ompi-checkpoint -v -s PID" and the
checkpoint file was created successfully. However it failed to restart using
ompi-restart:
*"mpirun noticed that process rank 0 with PID 21242 on node rc014.local
exited on signal 11 (Segmentation fault)"
*
Did I miss something in the installation of OpenMPI?

Regards,
Nguyen Toan