[OMPI users] Fw: Problem with checkpointing multihosts, multiprocesses MPI application

2010-01-02 Thread Kritiraj Sajadah
HI Averyone,
  Happy new year 2010. A few weeks ago I posted a query (please see 
email below) regarding checkpointing applications running on multiple hosts. I 
am still struggling to find a solution. I would really appreciate if someone 
could help me.

Thank you.

Raj




--- On Sat, 12/12/09, Kritiraj Sajadah  wrote:

> From: Kritiraj Sajadah 
> Subject: Problem with checkpointing multihosts, multiprocesses MPI application
> To: us...@open-mpi.org
> Date: Saturday, December 12, 2009, 3:03 PM
> Dear All,
>          I am trying to
> checkpoint am MPI application which has two processes each
> running on two seperate hosts.
> 
> I run the application as follows:
> 
> raj@sun32:~$ mpirun -am ft-enable-cr -np 2 --hostfile
> sunhost -mca btl ^openib -mca snapc_base_global_snapshot_dir
> /tmp m.
> 
> and I trigger the checkpoint as follows:
> 
> raj@sun32:~$ ompi-checkpoint -v 30010
> 
> 
> The following happens displaying two errors which
> checkpointng the application:
> 
> 
> ##
> I am processor no 0 of a total of 2 procs on host sun32
> I am processor no 1 of a total of 2 procs on host sun06
> I am processo no 0 of a total of 2 procs on host
> sun32 
> I am processo no 1 of a total of 2 procs on host
> sun06 
> 
> [sun32:30010] Error: expected_component: PID information
> unavailable!
> [sun32:30010] Error: expected_component: Component Name
> information unavailable!
> 
> I am proceor no 1 of a total of 2 procs on host
> sun06
> I am proceor no 0 of a total of 2 procs on host
> sun32
> bye 
> bye 
> 
> 
> 
> 
> 
> when I try to restart the application from the checkpointed
> file, I get the following:
> 
> raj@sun32:~$ ompi-restart ompi_global_snapshot_30010.ckpt
> --
> Error: The filename (opal_snapshot_1.ckpt) is invalid
> because either you have not provided a filename
>        or provided an invalid
> filename.
>        Please see --help for
> usage.
> 
> --
> I am proceor no 0 of a total of 2 procs on host
> sun32
> bye 
> 
> 
> I would very appreciate if you could give me some ideas on
> how to checkpoint and restart MPI application running on
> multiple hosts.
> 
> Thank you
> 
> Regards,
> 
> Raj
> 
> 
>       
> 






[OMPI users] problem restarting multiprocess mpi application

2009-12-13 Thread Kritiraj Sajadah
Dear All,
I am running a simple mpi application which looks as follows:

##

#include 
#include 
#include 
#include 
#include 

int main(int argc, char **argv)
{
int rank,size;

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("Hello\n"); 
sleep(15);
printf("Hello again\n" );
sleep(15);
printf("Final Hello\n"); 
sleep(15);
printf("bye \n");
MPI_Finalize();
return 0;
}
#

When I run my application as follows, it checkpoint correctly but when i try to 
restart it if gives the following errors:

##

ompi-restart ompi_global_snapshot_380.ckpt
Hello again
[sun06:00381] *** Process received signal ***
[sun06:00381] Signal: Bus error (7)
[sun06:00381] Signal code:  (2)
[sun06:00381] Failing at address: 0xae7cb054
[sun06:00381] [ 0] [0xb7f8640c]
[sun06:00381] [ 1] 
/home/raj/openmpisof/lib/libopen-pal.so.0(opal_progress+0x123) [0xb7b95456]
[sun06:00381] [ 2] /home/raj/openmpisof/lib/libopen-pal.so.0 [0xb7bcb093]
[sun06:00381] [ 3] /home/raj/openmpisof/lib/libopen-pal.so.0 [0xb7bcae97]
[sun06:00381] [ 4] 
/home/raj/openmpisof/lib/libopen-pal.so.0(opal_crs_blcr_checkpoint+0x187) 
[0xb7bca69b]
[sun06:00381] [ 5] 
/home/raj/openmpisof/lib/libopen-pal.so.0(opal_cr_inc_core+0xc3) [0xb7b970bd]
[sun06:00381] [ 6] /home/raj/openmpisof/lib/libopen-rte.so.0 [0xb7cab06f]
[sun06:00381] [ 7] 
/home/raj/openmpisof/lib/libopen-pal.so.0(opal_cr_test_if_checkpoint_ready+0x129)
 [0xb7b96fca]
[sun06:00381] [ 8] /home/raj/openmpisof/lib/libopen-pal.so.0 [0xb7b97698]
[sun06:00381] [ 9] /lib/libpthread.so.0 [0xb7ac4f3b]
[sun06:00381] [10] /lib/libc.so.6(clone+0x5e) [0xb7a4bbee]
[sun06:00381] *** End of error message ***
--
mpirun noticed that process rank 0 with PID 399 on node sun06 exited on signal 
7 (Bus error).
--
#

I am running it as follows:


mpirun -am ft-enable-cr -np 2 -mca btl ^openib -mca 
snapc_base_global_snapshot_dir /tmp mpisleepbas.


Once a checkpoint it taken, I have to copy it to the home directory and try to 
restart it.

please not that if i used - np 1, it works fine when i restart it. The problem 
is mainly when the application has more than one process running.


Any help will be very appreciated


Raj








[OMPI users] Problem with checkpointing multihosts, multiprocesses MPI application

2009-12-12 Thread Kritiraj Sajadah
Dear All,
 I am trying to checkpoint am MPI application which has two processes 
each running on two seperate hosts.

I run the application as follows:

raj@sun32:~$ mpirun -am ft-enable-cr -np 2 --hostfile sunhost -mca btl ^openib 
-mca snapc_base_global_snapshot_dir /tmp m.

and I trigger the checkpoint as follows:

raj@sun32:~$ ompi-checkpoint -v 30010


The following happens displaying two errors which checkpointng the application:


##
I am processor no 0 of a total of 2 procs on host sun32
I am processor no 1 of a total of 2 procs on host sun06
I am processo no 0 of a total of 2 procs on host sun32 
I am processo no 1 of a total of 2 procs on host sun06 

[sun32:30010] Error: expected_component: PID information unavailable!
[sun32:30010] Error: expected_component: Component Name information unavailable!

I am proceor no 1 of a total of 2 procs on host sun06
I am proceor no 0 of a total of 2 procs on host sun32
bye 
bye 





when I try to restart the application from the checkpointed file, I get the 
following:

raj@sun32:~$ ompi-restart ompi_global_snapshot_30010.ckpt
--
Error: The filename (opal_snapshot_1.ckpt) is invalid because either you have 
not provided a filename
   or provided an invalid filename.
   Please see --help for usage.

--
I am proceor no 0 of a total of 2 procs on host sun32
bye 


I would very appreciate if you could give me some ideas on how to checkpoint 
and restart MPI application running on multiple hosts.

Thank you

Regards,

Raj





[OMPI users] a good grid simulator to run open MPI applications

2009-12-06 Thread Kritiraj Sajadah
Hi All,
Can you recommend me a good open source Grid simulation tool to execute 
open mpi applcaiton.

Thanks

Raj





[OMPI users] get the process Id of mpirun

2009-11-14 Thread Kritiraj Sajadah
Dear All,
  I am trying to get the process Id of Mpirun from within my MPI 
application. When i use getpid() and getppid(), i get the PID of my application 
and the PID of "orted --daemonize -mca..." respectively. 
Is there a way to get the PID of the mpirun? In this case, it looks like it is 
the grandparent of the application.

Thank you 

Regards,

Raj





[OMPI users] mpirun noticed that process rank 1 ... exited on signal 13 (Broken pipe).

2009-11-06 Thread Kritiraj Sajadah
Hi Everyone,
  I have install openmpi 1.3 and blcr 0.81 on my laptop (single 
processor).

I am trying to checkpoint a small test application:

###

#include 
#include 
#include 
#include
#include

int main(int argc, char **argv)
{
int rank,size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("I am processor no %d of a total of %d procs \n", rank, size);
system("sleep 10");
printf("I am processor no %d of a total of %d procs \n", rank, size);
system("sleep 10");
printf("I am processor no %d of a total of %d procs \n", rank, size);
system("sleep 10");
printf("mpisleep bye \n");
MPI_Finalize();
return 0;
}
###

I compile it as follows:

mpicc mpisleep.c -o mpisleep

and i run it as follows:

mpirun -am ft-enable-cr -np 2 mpisleep.

When i try checkpointing ( ompi-checkpoint -v 8118) it, it checkpoints fine but 
when i restart it, i get the following:

I am processor no 0 of a total of 2 procs 
I am processor no 1 of a total of 2 procs 
mpisleep bye 
--
mpirun noticed that process rank 1 with PID 8118 on node raj-laptop exited on 
signal 13 (Broken pipe).
--

Any suggestions is very much appreciated

Raj





[OMPI users] problem using openmpi with DMTCP

2009-09-28 Thread Kritiraj Sajadah
Dear All, 
  I am trying to integrate DMTCP with openmpi. IF I run a c 
application, it works fine. But when I execute the program using mpirun, It 
checkpoints application but gives error when restarting the application.

#
[31007] WARNING at connection.cpp:303 in restore; REASON='JWARNING((_sockDomain 
== AF_INET || _sockDomain == AF_UNIX ) && _sockType == SOCK_STREAM) failed'
 id() = 2ab3f248-30933-4ac0d75a(99007)
 _sockDomain = 10
 _sockType = 1
 _sockProtocol = 0
Message: socket type not yet [fully] supported
[31007] WARNING at connection.cpp:303 in restore; REASON='JWARNING((_sockDomain 
== AF_INET || _sockDomain == AF_UNIX ) && _sockType == SOCK_STREAM) failed'
 id() = 2ab3f248-30943-4ac0d75c(99007)
 _sockDomain = 10
 _sockType = 1
 _sockProtocol = 0
Message: socket type not yet [fully] supported
[31013] WARNING at connection.cpp:87 in restartDup2; 
REASON='JWARNING(_real_dup2 ( oldFd, fd ) == fd) failed'
 oldFd = 537
 fd = 1
 (strerror((*__errno_location ( = Bad file descriptor
[31013] WARNING at connectionmanager.cpp:627 in closeAll; 
REASON='JWARNING(_real_close ( i->second ) ==0) failed'
 i->second = 537
 (strerror((*__errno_location ( = Bad file descriptor
[31015] WARNING at connectionmanager.cpp:627 in closeAll; 
REASON='JWARNING(_real_close ( i->second ) ==0) failed'
 i->second = 537
 (strerror((*__errno_location ( = Bad file descriptor
[31017] WARNING at connectionmanager.cpp:627 in closeAll; 
REASON='JWARNING(_real_close ( i->second ) ==0) failed'
 i->second = 537
 (strerror((*__errno_location ( = Bad file descriptor
[31007] WARNING at connectionmanager.cpp:627 in closeAll; 
REASON='JWARNING(_real_close ( i->second ) ==0) failed'
 i->second = 537
 (strerror((*__errno_location ( = Bad file descriptor
MTCP: mtcp_restart_nolibc: mapping current version of 
/usr/lib/gconv/gconv-modules.cache into memory;
  _not_ file as it existed at time of checkpoint.
  Change mtcp_restart_nolibc.c:634 and re-compile, if you want different 
behavior.
[31015] ERROR at connection.cpp:372 in restoreOptions; REASON='JASSERT(ret == 
0) failed'
 (strerror((*__errno_location ( = Invalid argument
 fds[0] = 6
 opt->first = 26
 opt->second.size() = 4
Message: restoring setsockopt failed
Terminating...
#

Any suggestions is very welcomed.

regards,

Raj





Re: [OMPI users] configure OPENMPI with DMTCP

2009-08-13 Thread Kritiraj Sajadah
Hi Josh,
  I can't access the link you gave. Its a secure link and I think needs 
authentication. 

Thanks

Raj

--- On Thu, 8/13/09, Josh Hursey  wrote:

> From: Josh Hursey 
> Subject: Re: [OMPI users] configure OPENMPI with DMTCP
> To: "Open MPI Users" 
> Date: Thursday, August 13, 2009, 2:40 PM
> 
> On Aug 12, 2009, at 3:35 PM, Kritiraj Sajadah wrote:
> 
> > HI,
> >   I want to configure OPENMPI to
> checkpoint MPI applications using DMTCP. Does anyone know
> how to specify the path to the DMTCP application when
> installing OPENMPI.
> 
> I have not experimented with Open MPI using DMTCP. If I
> understand their website and papers correctly, DMTCP can
> work with Open MPI without modification (though I do not
> know to what degree of coverage), so you -should- not need
> specify anything when building Open MPI.
> 
> > 
> > Also, I wanted to use OPENMPI with SELF instead of
> BLCR. Is there any guide for setting up OPENMPI with SELF?
> 
> There are instructions for this in the Checkpoint/Restart
> User's Guide posted to the Open MPI wiki:
>   https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR
> 
> -- Josh
> 
> > 
> > Thanks a lot.
> > 
> > Regards,
> > 
> > Raj
> > 
> > 
> > 
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 






[OMPI users] configure OPENMPI with DMTCP

2009-08-12 Thread Kritiraj Sajadah
HI,
   I want to configure OPENMPI to checkpoint MPI applications using DMTCP. Does 
anyone know how to specify the path to the DMTCP application when installing 
OPENMPI.

Also, I wanted to use OPENMPI with SELF instead of BLCR. Is there any guide for 
setting up OPENMPI with SELF?

Thanks a lot.

Regards,

Raj





Re: [OMPI users] Checkpointing automatically at regular intervals

2009-06-30 Thread Kritiraj Sajadah

Dear Josh,
I am sure it will definitely be good because if someone is using 
OPEN MPI for checkpointing his application, he will not want to sit and 
checkpoint the application manually; and this can be a real pain if its a long 
running application.

I would imagine an automatic restart from the last checkpoint in case of 
failure would also be interesting.

Many thanks.

Regards,

Kritiraj

--- On Tue, 6/30/09, Josh Hursey  wrote:

> From: Josh Hursey 
> Subject: Re: [OMPI users] Checkpointing automatically at regular intervals
> To: "Open MPI Users" 
> Date: Tuesday, June 30, 2009, 3:00 PM
> Currently, there is no mechanism to
> checkpoint every X minutes in Open MPI.
> 
> As mentioned below you can use a script to initiate the
> checkpoint every X minutes. Alternatively it should not be
> too difficult to add such a feature to Open MPI. If enough
> people would be interested I can file a feature bug to add
> such a feature in a future release.
> 
> Josh
> 
> On Jun 30, 2009, at 9:34 AM, Mohamed Slim bouguerra wrote:
> 
> > Hi,
> > I think that you can write a simple script such as:
> > 
> > wihle `pgrep mpirun`  != ""
> > ompi-checkpoint `pidof mpirun`
> > sleep 5
> > done
> > 
> > Le 30 juin 09 à 14:29, Kritiraj Sajadah a écrit :
> > 
> >> 
> >> Dear All,
> >>        I can manually
> checkpoint an MPI application using OPEN MPI and BLCR.
> However, I now want to checkpointing my application
> automatically at every 5 minutes. Is there a way in OPEN MPI
> to ensure automatic checkpointing without the user
> intervention while the application is running?
> >> 
> >> Thank you
> >> 
> >> Regards,
> >> Kritiraj
> >> 
> >> 
> >> 
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> 
> > 
> > 
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 






Re: [OMPI users] Apllication level checkpointing tools.

2009-06-30 Thread Kritiraj Sajadah

Dear Mohamed,
 Thank you for the link

Regards,

Raj

--- On Tue, 6/30/09, Mohamed Slim bouguerra 
 wrote:

> From: Mohamed Slim bouguerra 
> Subject: Re: [OMPI users] Apllication level checkpointing tools.
> To: "Open MPI Users" 
> Date: Tuesday, June 30, 2009, 1:09 PM
> Dear Kritiraj,
> You can use DMTCP  http://sourceforge.net/projects/dmtcp
> 
> Le 30 juin 09 à 13:59, Kritiraj Sajadah a écrit :
> 
> > 
> > Daer All,
> >          I have successfully
> comfigure OPENMPI with BLCR and id some test. hover, i now
> want to do some testing with an Application Level
> checkpointng tools.  I tried using libckpt but could
> not install it.
> > 
> > Do anyone of you know any open source application
> level checkpointing tools available that i can install and
> test with openmpi?
> > 
> > Thank you
> > 
> > Regards,
> > 
> > Raj
> > 
> > 
> > 
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 






[OMPI users] Checkpointing automatically at regular intervals

2009-06-30 Thread Kritiraj Sajadah

Dear All, 
 I can manually checkpoint an MPI application using OPEN MPI and BLCR. 
However, I now want to checkpointing my application automatically at every 5 
minutes. Is there a way in OPEN MPI to ensure automatic checkpointing without 
the user intervention while the application is running?

Thank you

Regards,
Kritiraj





[OMPI users] Apllication level checkpointing tools.

2009-06-30 Thread Kritiraj Sajadah

Daer All,
  I have successfully comfigure OPENMPI with BLCR and id some test. 
hover, i now want to do some testing with an Application Level checkpointng 
tools.  I tried using libckpt but could not install it. 

Do anyone of you know any open source application level checkpointing tools 
available that i can install and test with openmpi?

Thank you

Regards,

Raj





Re: [OMPI users] vfs_write returned -14

2009-06-20 Thread Kritiraj Sajadah

Hi Josh,
  Thank you for the email. I can now checkpoint the application on the 
cluster using  OPEN MPI. But I am now facing another problem.

When i tried restarting the checkpoint, nothing happens. I copied the 
checkpoint file to the $HOME directory and tried restarting it there and got 
the following error:

- open('/var/cache/nscd/passwd', 0x0) failed: -13
- mmap failed: /var/cache/nscd/passwd
- thaw_threads returned error, aborting. -13
- thaw_threads returned error, aborting. -13
- thaw_threads returned error, aborting. -13
Restart failed: Permission denied

On my laptop it works fine. So, I am assuming its again something to do with my 
$HOME directory.

Is it possible to restart the chekpoint from the /tmp directory itself without 
have to copy it back to the $HOME directory.

I s there another way to compile and build openmpi so that everthing happens in 
the /tmp directory instead of the $HOME directory?

Thank you

Raj

--- On Fri, 6/19/09, Josh Hursey  wrote:

> From: Josh Hursey 
> Subject: Re: [OMPI users] vfs_write returned -14
> To: "Open MPI Users" 
> Date: Friday, June 19, 2009, 2:48 PM
> 
> On Jun 18, 2009, at 7:33 PM, Kritiraj Sajadah wrote:
> 
> >
> > Hello Josh,
> >           ThanK you
> again for your respond. I tried chekpointing a  
> > simple c program using BLCR...and got the same error,
> i.e:
> >
> > - vfs_write returned -14
> > - file_header: write returned -14
> > Checkpoint failed: Bad address
> 
> So I would look at how your NFS file system is setup, and
> work with  
> your sysadmin (and maybe the BLCR list) to resolve this
> before  
> experimenting too much with checkpointing with Open MPI.
> 
> >
> > This is how i installed and run mpi programs for
> checkpointing:
> >
> > 1) configure and install blcr
> > 2) configure and install openmpi
> > 3)  Compile and run mpi program as follows:
> > 4) To checkpoint the running program,
> > 5) To restart your checkpoint, locate the checkpoint
> file and type  
> > the following from the command line:
> >
> 
> This all looks ok to me.
> 
> > The did another test with BLCR however,
> >
> > I tried checkpointing my c application from the /tmp
> directory  
> > instead of my $HOME directory and it checkpointed
> fine.
> >
> > So, it looks like the problem is with my $HOME
> directory.
> >
> > I have "drwx" rights on my $HOME directory which seems
> fine for me.
> >
> > Then i tried it with open MPI.  However, with
> open mpi the  
> > checkpoint file automatically get saved in the $HOME
> directory.
> >
> > Is there a way to have the file saved in a different
> location? I  
> > checked that LAM/MPI has some command line 
> options :
> >
> > $ mpirun -np 2 -ssi cr_base_dir /somewhere/else a.out
> >
> > Do we have a similar option for open mpi?
> 
> By default Open MPI places the global snapshot in the $HOME
> directory.  
> But you can also specify a different directory for the
> global snapshot  
> using the following MCA option:
>    -mca snapc_base_global_snapshot_dir
> /somewhere/else
> 
> For the best results you will likely want to set this in
> the MCA  
> params file in your home directory:
>   shell$ cat ~/.openmpi/mca-params.conf
>   snapc_base_global_snapshot_dir=/somewhere/else
> 
> You can also stage the file to local disk, then have Open
> MPI transfer  
> the checkpoints back to a {logically} central storage
> device (both can  
> be /tmp on a local disk if you like). For more details on
> this and the  
> above option you will want to read through the FT Users
> Guide attached  
> to the wiki page at the link below:
>    https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR
> 
> -- Josh
> 
> >
> >
> > Thanks a lot
> >
> > regards,
> >
> > Raj
> >
> > --- On Wed, 6/17/09, Josh Hursey 
> wrote:
> >
> >> From: Josh Hursey 
> >> Subject: Re: [OMPI users] vfs_write returned -14
> >> To: "Open MPI Users" 
> >> Date: Wednesday, June 17, 2009, 1:42 AM
> >> Did you try checkpointing a non-MPI
> >> application with BLCR on the
> >> cluster? If that does not work then I would
> suspect that
> >> BLCR is not
> >> working properly on the system.
> >>
> >> However if a non-MPI application can be
> checkpointed and
> >> restarted
> >> correctly on this machine then it may be something
> odd with
> >> the Open
> >> MPI installation or runtime environment. To help
> debug here
> >> I woul

Re: [OMPI users] vfs_write returned -14

2009-06-18 Thread Kritiraj Sajadah

Hello Josh,
   ThanK you again for your respond. I tried chekpointing a simple c 
program using BLCR...and got the same error, i.e:

- vfs_write returned -14
- file_header: write returned -14
Checkpoint failed: Bad address


This is how i installed and run mpi programs for checkpointing:

1) configure and install blcr

tar zxf blcr-.tar.gz
cd blcr-
mkdir builddir
cd builddir

../configure --prefix=/usr/local/ --enable-debug=yes --enable-libcr-tracing=yes 
--enable-kernel-tracing=yes --enable-testsuite=yes --enable-all-static=yes 
--enable-static=yes

make
make install

2) configure and install openmpi

./configure --prefix=/usr/local/ --enable-picky --enable-debug 
--enable-mpi-profile --enable-mpi-cxx --enable-pretty-print-stacktrace 
--enable-binaries --enable-trace --enable-static=yes --enable-debug 
--with-devel-headers=1 --with-mpi-param-check=always --with-ft=cr 
--enable-ft-thread --with-blcr=/usr/local/ --with-blcr-libdir=/usr/local/lib 
--enable-mpi-threads=yes

make all install

3)  Compile and run mpi program as follows:

 raj> mpicc helloworld.c -o helloworld
 raj> mpirun -am ft-enable-cr helloworld

4) To checkpoint the running program,

 raj>  ompi-checkpoint [any option] pid 
 for example:   ompi-checkpoint -v 11527

5) To restart your checkpoint, locate the checkpoint file and type the 
following from the command line:

  raj> mpi-restart ompi_global_snapshot_.ckpt


The did another test with BLCR however,

I tried checkpointing my c application from the /tmp directory instead of my 
$HOME directory and it checkpointed fine.

So, it looks like the problem is with my $HOME directory.

I have "drwx" rights on my $HOME directory which seems fine for me.

Then i tried it with open MPI.  However, with open mpi the checkpoint file 
automatically get saved in the $HOME directory. 

Is there a way to have the file saved in a different location? I checked that 
LAM/MPI has some command line  options :

$ mpirun -np 2 -ssi cr_base_dir /somewhere/else a.out

Do we have a similar option for open mpi?

Thanks a lot

regards,

Raj

--- On Wed, 6/17/09, Josh Hursey  wrote:

> From: Josh Hursey 
> Subject: Re: [OMPI users] vfs_write returned -14
> To: "Open MPI Users" 
> Date: Wednesday, June 17, 2009, 1:42 AM
> Did you try checkpointing a non-MPI
> application with BLCR on the  
> cluster? If that does not work then I would suspect that
> BLCR is not  
> working properly on the system.
> 
> However if a non-MPI application can be checkpointed and
> restarted  
> correctly on this machine then it may be something odd with
> the Open  
> MPI installation or runtime environment. To help debug here
> I would  
> need to know how Open MPI was configured and how the
> application was  
> ran on the machine (command line arguments, environment
> variables, ...).
> 
> I should note that for the program that you sent it is
> important that  
> you compile Open MPI with the Fault Tolerance Thread
> enabled to ensure  
> a timely checkpoint. Otherwise the checkpoint will be
> delayed until  
> the MPI program enters the MPI_Finalize function.
> 
> Let me know what you find out.
> 
> Josh
> 
> On Jun 16, 2009, at 5:08 PM, Kritiraj Sajadah wrote:
> 
> >
> > Hi Josh,
> >
> > Thanks for the email. I have install BLCR 0.8.1 and
> openmpi 1.3 on  
> > my laptop with Ubuntu 8.04 on it. It works fine.
> >
> > I now tried the installation on the cluster ( on one
> machine for  
> > now) in my university. ( the administrator installed
> it) i am not  
> > sure if he followed the steps i gave him.
> >
> > I am checkpointing a simple mpi application which
> looks as follows:
> >
> > #include 
> > #include 
> >
> > int main(int argc, char **argv)
> > {
> > int rank,size;
> > MPI_Init(&argc, &argv);
> > MPI_Comm_rank(MPI_COMM_WORLD, &rank);
> > MPI_Comm_size(MPI_COMM_WORLD, &size);
> > printf("I am processor no %d of a total of %d procs
> \n", rank, size);
> > system("sleep 30");
> > printf("I am processor no %d of a total of %d procs
> \n", rank, size);
> > system("sleep 30");
> > printf("I am processor no %d of a total of %d procs
> \n", rank, size);
> > system("sleep 30");
> > printf("bye \n");
> > MPI_Finalize();
> > return 0;
> > }
> >
> > Do you think its better to re install BLCR?
> >
> >
> > Thanks
> >
> > Raj
> > --- On Tue, 6/16/09, Josh Hursey 
> wrote:
> >
> >> From: Josh Hursey 
> >> Subject: Re: [OMPI users] vfs_write returned -14
> >> To: "Open MPI Users" 

Re: [OMPI users] vfs_write returned -14

2009-06-16 Thread Kritiraj Sajadah

Hi Josh,

Thanks for the email. I have install BLCR 0.8.1 and openmpi 1.3 on my laptop 
with Ubuntu 8.04 on it. It works fine.

I now tried the installation on the cluster ( on one machine for now) in my 
university. ( the administrator installed it) i am not sure if he followed the 
steps i gave him.

I am checkpointing a simple mpi application which looks as follows:

#include 
#include 

int main(int argc, char **argv)
{
int rank,size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("I am processor no %d of a total of %d procs \n", rank, size);
system("sleep 30");
printf("I am processor no %d of a total of %d procs \n", rank, size);
system("sleep 30");
printf("I am processor no %d of a total of %d procs \n", rank, size);
system("sleep 30");
printf("bye \n");
MPI_Finalize();
return 0;
}

Do you think its better to re install BLCR?


Thanks 

Raj
--- On Tue, 6/16/09, Josh Hursey  wrote:

> From: Josh Hursey 
> Subject: Re: [OMPI users] vfs_write returned -14
> To: "Open MPI Users" 
> Date: Tuesday, June 16, 2009, 6:42 PM
> 
> These are errors from BLCR. It may be a problem with your
> BLCR installation and/or your application. Are you able to
> checkpoint/restart a non-MPI application with BLCR on these
> machines?
> 
> What kind of MPI application are you trying to checkpoint?
> Some of the MPI interfaces are not fully supported at the
> moment (outlined in the FT User Document that I mentioned in
> a previous email).
> 
> -- Josh
> 
> On Jun 16, 2009, at 11:30 AM, Kritiraj Sajadah wrote:
> 
> > 
> > Dear All,
> >          I have install
> openmpi 1.3 and blcr 0.8.1 on a linux machine (ubuntu).
> however, when i try checkpointing an MPI application, I get
> the following error:
> > 
> > - vfs_write returned -14
> > - file_header: write returned -14
> > 
> > Can someone help please.
> > 
> > Regards,
> > 
> > Raj
> > 
> > 
> > 
> > 
> > 
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 






[OMPI users] vfs_write returned -14

2009-06-16 Thread Kritiraj Sajadah

Dear All,
  I have install openmpi 1.3 and blcr 0.8.1 on a linux machine 
(ubuntu). however, when i try checkpointing an MPI application, I get the 
following error:

- vfs_write returned -14
- file_header: write returned -14

Can someone help please.

Regards,

Raj







[OMPI users] Segmentation fault (11)

2009-06-15 Thread Kritiraj Sajadah

Dear All,
 I have installed BLCR 0.8.1 and OPENMPI 1.3 on a linux platform. 
However, when i tried checkpoiting an application, it hangs forever just before 
ending.

A chekcpoint file is generated. However, when i try restarting it, i get the 
following error: 

raj@sun06:~$ ompi-restart ompi_global_snapshot_22390.ckpt
[sun06:22423] *** Process received signal ***
[sun06:22423] Signal: Segmentation fault (11)
[sun06:22423] Signal code: Address not mapped (1)
[sun06:22423] Failing at address: (nil)
[sun06:22423] [ 0] [0xb7fb640c]
[sun06:22423] [ 1] 
/usr/local/openmpi/lib/libopen-pal.so.0(opal_crs_blcr_restart+0x103) 
[0xb7f76925]
[sun06:22423] [ 2] opal-restart [0x8049435]
[sun06:22423] [ 3] /lib/libc.so.6(__libc_start_main+0xe5) [0xb7d9a455]
[sun06:22423] [ 4] opal-restart [0x8049001]
[sun06:22423] *** End of error message ***
--
mpirun noticed that process rank 0 with PID 22423 on node sun06 exited on 
signal 11 (Segmentation fault).
--

Any help will be very appreciated.

kind regards,

Raj





[OMPI users] Compiling and Building OPENMPI for checkpointing using self

2009-06-06 Thread Kritiraj Sajadah

HI All,
   I have successfully install and configured openmpi to perfrom 
checkpointing using the BLCR mechanism. However, i now want to to try 
checkpointing using self.

Has anyone do that? If so, i would very much appreciate if anyone of you could 
sent be the steps necessary to enable slef checkpointing.

Many thanks.

Raj






Re: [OMPI users] *** An error occurred in MPI_Init

2009-05-08 Thread Kritiraj Sajadah

Hi Gus,

 Thanks for your email. I have /usr/local/bin included in my $PATH. 
(Not /usr/local/include - it was just a copying mistake).

I checked where mpicc and mpirun are and i got the following path 

/usr/local/bin/mpirun
/usr/local/bin/mpicc

The BLCR  I am using was downloaded and installed seperately.

1) Do you think i may be using the wrong version of BLCR?.  
There is a directory called blcr within the openmpi tarball 
(openmpi-1.3/opal/mca/crs/blcr). Should I use this?

2) DO you think it's better to install openmpi in /usr/local/openmpi and blcr 
in/usr/local/blcr?

3) If so, how do i uninstall the one i have already?

Thank you

Kritiraj 



--- On Fri, 5/8/09, Gus Correa  wrote:

> From: Gus Correa 
> Subject: Re: [OMPI users] *** An error occurred in MPI_Init
> To: "Open MPI Users" 
> Date: Friday, May 8, 2009, 6:33 PM
> PS - Kritiraj
> 
> Reading your message more carefully, I saw that you did
> this:
> 
> 
> Open the $HOME/.bashrc and added the following:
> 
> PATH="/usr/local/include:$PATH"
> LD_LIBRARY_PATH="/usr/local/lib:$LD_LIBRARY_PATH"
> 
> 
> 
> However, this is what you should have done:
> 
> 
> Open the $HOME/.bashrc and added the following:
> 
> PATH="/usr/local/bin:$PATH"
> LD_LIBRARY_PATH="/usr/local/lib:$LD_LIBRARY_PATH"
> 
> 
> 
> Note that /usr/local/bin, not /usr/local/include should be
> pre-pended to your PATH!
> 
> 
> Gus Correa
> -
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> -
> 
> 
> Gus Correa wrote:
> > Hi Kritiraj
> > 
> > This looks like as many other errors reported on this
> list
> > that are caused by using the wrong MPI compiler
> wrappers
> > or the wrong mpirun/mpiexec.
> > Typically this is caused by a PATH environment
> variable that
> > is pointing to the wrong executables (mpicc, mpirun).
> > Most Linux distributions, compilers, etc, come with
> their
> > own MPI versions, and this can be very confusing.
> > 
> > Try using full path names for mpicc and for mpirun.
> > That is bullet proof method to get exactly what you
> want.
> > In your case use /usr/local/bin (as you configured
> with --prefix=/usr/local).
> > (Actually, I prefer to configure with a more
> distinctive
> > name to the prefix, something like
> /usr/local/openmpi-1.3.2,
> > to avoid any confusion with other MPIs.)
> > 
> > You can also try "which mpicc" and "which mpirun",
> > or "mpicc --showme" and "mpirun --help" to get a bit
> more
> > information about what you are really using.
> > 
> > I hope this helps.
> > Gus Correa
> >
> -
> > Gustavo Correa
> > Lamont-Doherty Earth Observatory - Columbia
> University
> > Palisades, NY, 10964-8000 - USA
> >
> -
> > 
> > 
> > Kritiraj Sajadah wrote:
> >> Dear All,
> >>           I
> have install and configured openmpi with BLCR on my laptop:
> >> 
> >> 1) configure and install blcr
> >> 
> >> ./configure --prefix=/usr/local/
> --enable-debug=yes --enable-libcr-tracing=yes
> --enable-kernel-tracing=yes --enable-testsuite=yes
> --enable-all-static=yes --enable-static=yes
> >> 
> >> make
> >> make install
> >> 
> >> 2) configure and install openmpi
> >> 
> >> ./configure --prefix=/usr/local/ --enable-picky
> --enable-debug --enable-mpi-profile --enable-mpi-cxx
> --enable-pretty-print-stacktrace --enable-binaries
> --enable-trace --enable-static=yes --enable-debug
> --with-devel-headers=1 --with-mpi-param-check=always
> --with-ft=cr --enable-ft-thread --with-blcr=/usr/local/
> --with-blcr-libdir=/usr/local/lib --enable-mpi-threads=yes
> >> 
> >> make all install
> >> 
> >> 3) add the environment variables.
> >> 
> >> 
> >> Open the $HOME/.bashrc and added the following:
> >> 
> >> PATH="/usr/local/include:$PATH"
> >> LD_LIBRARY_PATH="/usr/local/lib:$LD_LIBRARY_PATH"
> >> 
> >> Now the problem:
> >> 
> >> I am trying to checkpoint the following MPI
> application:
> >> 
> >> #include 
> >> #include 
> >> 
> >> main(int argc,

[OMPI users] *** An error occurred in MPI_Init

2009-05-08 Thread Kritiraj Sajadah

Dear All,
  I have install and configured openmpi with BLCR on my laptop:

1) configure and install blcr

./configure --prefix=/usr/local/ --enable-debug=yes --enable-libcr-tracing=yes 
--enable-kernel-tracing=yes --enable-testsuite=yes --enable-all-static=yes 
--enable-static=yes

make
make install

2) configure and install openmpi

./configure --prefix=/usr/local/ --enable-picky --enable-debug 
--enable-mpi-profile --enable-mpi-cxx --enable-pretty-print-stacktrace 
--enable-binaries --enable-trace --enable-static=yes --enable-debug 
--with-devel-headers=1 --with-mpi-param-check=always --with-ft=cr 
--enable-ft-thread --with-blcr=/usr/local/ --with-blcr-libdir=/usr/local/lib 
--enable-mpi-threads=yes

make all install

3) add the environment variables.


Open the $HOME/.bashrc and added the following:

PATH="/usr/local/include:$PATH"
LD_LIBRARY_PATH="/usr/local/lib:$LD_LIBRARY_PATH"

Now the problem:

I am trying to checkpoint the following MPI application:

#include 
#include 

main(int argc, char **argv)
{
   int node;

   MPI_Init(&argc,&argv);
   MPI_Comm_rank(MPI_COMM_WORLD, &node);

   printf("Hello World from Node %d\n",node);

   MPI_Finalize();
}

I am running mpirun as follows:

raj-laptop> mpirun -am ft-enable-cr helloworld.

The errors are as follows:

--
It looks like opal_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_cr_init() failed failed
  --> Returned value -1 instead of OPAL_SUCCESS
--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[raj-laptop:9439] Abort before MPI_INIT completed successfully; not able to 
guarantee that all other processes were killed!
[raj-laptop:09439] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file 
runtime/orte_init.c at line 77
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: orte_init failed
  --> Returned "Error" (-1) instead of "Success" (0)
--

Is it something to do with me running it on a single node; i.e my laptop? or is 
it something to do with configurations or libraries?


Any help will be very appreciated.

Regards,

Raj






[OMPI users] error while loading shared libraries: libcr.so.0: cannot open shared object file: No such file or directory.

2009-05-04 Thread Kritiraj Sajadah

Dear All,
I have install openmpi and blcr on my laptop and is trying to 
checkpoint an mpi application.

Both openmpi and blcr are installed in /usr/local.

When i try to checkpoint and mpi application, i get the following error:

error while loading shared libraries: libcr.so.0: cannot open shared object 
file: No such file or directory.

Any help would be very much appreciated.

Regards,

Raj





Re: [OMPI users] mca: base: component_find: unable to open/usr/local/lib/openmpi/mca_crs_blcr: file not found (ignored)

2009-05-04 Thread Kritiraj Sajadah

Hi Jeff,
  In fact i am testing it on my laptop before installing it on the 
cluster. 

I downloaded BLCR and installed it in /usr/local on my laptop

Then i installed openmpi using the following option:

 ./configure --prefix=/usr/local --with-ft=cr --enable-ft-thread 
--enable-mpi-threads --with-blcr=/usr/local/lib

So, everything is installed and tested on my laptop for now but i am still 
getting the error.

Please help.

Thanks 

Raj



--- On Mon, 5/4/09, Jeff Squyres  wrote:

> From: Jeff Squyres 
> Subject: Re: [OMPI users] mca: base: component_find: unable to 
> open/usr/local/lib/openmpi/mca_crs_blcr: file not found (ignored)
> To: "Open MPI Users" 
> Date: Monday, May 4, 2009, 2:09 PM
> On May 4, 2009, at 9:06 AM, Kritiraj
> Sajadah wrote:
> 
> > raj@raj:mpirun -np 1 -am ft-enable-cr mpisleep
> > 
> > I got the following with no checkpointing performed:
> > raj@raj:mca: base: component_find: unable to open
> /usr/local/lib/openmpi/mca_crs_blcr: file not found
> (ignored)
> > 
> 
> This is usually a faulty error message from libltdl. 
> It usually means that the dependent libraries for a
> component cannot be found -- e.g., is blcr installed on
> every node where you're trying to use it?
> 
> --Jeff Squyres
> Cisco Systems
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 






[OMPI users] mca: base: component_find: unable to open /usr/local/lib/openmpi/mca_crs_blcr: file not found (ignored)

2009-05-04 Thread Kritiraj Sajadah

Dear All,
  Thanks to Josh and Yaakoub, i was able to configure my openmpi as 
follows:

raj@raj:./configure --prefix=/usr/local --with-ft=cr --enable-ft-thread 
--enable-mpi-threads --with-blcr=/usr/local.

raj@raj:make all install

I try to checkppoint an mpi application using the following command running on 
a single node:

raj@raj:mpirun -np 1 -am ft-enable-cr mpisleep

I got the following with no checkpointing performed:
raj@raj:mca: base: component_find: unable to open 
/usr/local/lib/openmpi/mca_crs_blcr: file not found (ignored)

Please help.

Regards,

Raj





[OMPI users] Checkpointing configuration problem

2009-05-01 Thread Kritiraj Sajadah

Dear all, 
I am trying to install openmpi 1.3 on my laptop. I successfully 
installed BLCR in /usr/local.

When installing openmpi using the following options:

 ./configure --prefix=/usr/local --with-ft=cr --enable-ft-thread 
--enable-MPI-thread --with-blcr=/usr/local

I got the following error:


== System-specific tests

...

checking if want fault tolerance thread... Must enable progress or MPI threads 
to use this option
configure: error: Cannot continue

Help please.

regards,

Raj





[OMPI users] checkpoint file contains nothing

2008-06-29 Thread Kritiraj Sajadah
HI,
   I have installed the openmpi-1.3a1r18651 and tried to checkpoint an mpi 
application. 

raj@portal018:~/examples> mpirun  -np 1 -am ft-enable-cr ./myapp.sh &

raj@portal018:~/examples> ompi-checkpoint --term 30416


However, when i try to restart the checkped file, I get the following message. 


raj@portal018:~> ompi-restart -v -machinefile portal018 
ompi_global_snapshot_30416.ckpt
[portal018:20178] Checking for the existence of 
(/home/raj/ompi_global_snapshot_30416.ckpt)
[portal018:20178] Restarting from file (ompi_global_snapshot_30416.ckpt)
[portal018:20178]Exec in self
--
mpirun could not find anything to do.
It is possible that you forgot to specify how many processes to run
via the "-np" argument.
--


Any help will be very appreciated.

Regards,

Raj