Re: [OMPI users] openmpi self checkpointing - error while running example

2011-05-24 Thread Faisal
Hellmüller  Roman  student.ethz.ch> writes:

> 
> Hi
> 
> I'm trying to get fault tolerant ompi running on our cluster for my 
semesterthesis.
> 
> Build & compile were successful, blcr checkpointing works. openmpi 1.5.3, 
> blcr 
0.8.2
> 
> Now i'm trying to set up the SELF checkpointing. the example from
> http://osl.iu.edu/research/ft/ompi-cr/examples.php does not work. I can run 
the application and
> also do checkpoints, but restarting won't work.  I got the following error by 
doning as sugested:
> 
> mpicc my-app.c -export -export-dynamic -o my-app
> 
> mpirun -np 2 -am ft-enable-cr -mca crs_self_prefix my_personal my-app
> 
> hroman  cbl1 ~ $ ompi-restart ompi_global_snapshot_27167.ckpt/
> --
> Error: Unable to obtain the proper restart command to restart from the
>checkpoint file (opal_snapshot_0.ckpt). Returned -1.
> 
> --
> --
> Error: Unable to obtain the proper restart command to restart from the
>checkpoint file (opal_snapshot_1.ckpt). Returned -1.
> 
> --
> --
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
> 
> i also tryed around with setting the path in the example file (restart_path 
variable), changing the
> checkpoint directorys, and running the application in different directorys...
> 
> do you have an idea where the error could be?
> 
> here
> 
http://n.ethz.ch/~hroman/downloads/ompi_mailinglist.tar.gz
> (40MB) you'll find the library and the build of openmpi & blcr as well as the 
env variables and the output of
> ompi_info. there is one for the login and the other for the compute nodes due 
to different kernels.  and here
> 
http://n.ethz.ch/~hroman/downloads/ompi_global_snapshot_27167.ckpt.tar.gz
> there is the produced checkpoint. please let me know if more outputs are 
needed.
> 
> cheers
> roman
> 

Hi Roman,

Try putting name of your executable at end of the path.
char restart_path[128] = "/full/path/to/personal-cr"; 
Here 'personal-cr' is executable.

I hope it helps.

Kind regards,
Faisal




Re: [OMPI users] openmpi self checkpointing - error while running example

2011-04-06 Thread Hellmüller Roman
Hi Toan

no that didn't change anything. i'm trying to restart the program on the 
computer it run before and i execute the ompi-restart on the same.

machinefile_cbl1 contains just cbl1

hroman@cbl1 ~/checkpoints $ ompi-restart -v -machinefile machinefile_cbl1 
ompi_global_snapshot_28952.ckpt/
[cbl1:30308] Checking for the existence of 
(/home/hroman/checkpoints/ompi_global_snapshot_28952.ckpt)
[cbl1:30308] Restarting from file (ompi_global_snapshot_28952.ckpt/)
[cbl1:30308]  Exec in self
--
Error: Unable to obtain the proper restart command to restart from the
   checkpoint file (opal_snapshot_0.ckpt). Returned -1.

--
--
Error: Unable to obtain the proper restart command to restart from the
   checkpoint file (opal_snapshot_1.ckpt). Returned -1.

--
--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--

cheers
roman


Von: users-boun...@open-mpi.org [users-boun...@open-mpi.org]" im Auftrag von 
"Nguyen Toan [nguyentoan1...@gmail.com]
Gesendet: Mittwoch, 6. April 2011 15:00
Bis: Open MPI Users
Betreff: Re: [OMPI users] openmpi self checkpointing - error while running 
example

Hi Roman,

It seems that you misunderstand the parameter "-machinefile".
Following this parameter shoud be a file containing a list of machines
which your MPI application will be run on. For example, you want to
run your app on 2 nodes, named "node1" and "node2", then this file, let call it 
"MACHINES_FILE", should look like this:

node1
node2

Now try to checkpoint and restart again with "-machinefile MACHINES_FILE". Hope 
it works.

On Wed, Apr 6, 2011 at 9:13 PM, Hellmüller Roman 
<hro...@student.ethz.ch<mailto:hro...@student.ethz.ch>> wrote:
Hi Toan

Thx for your suggestion. It gives me the following result, which does not tell 
anything more.

hroman@cbl1 ~/checkpoints $ ompi-restart -v  -machinefile 
../semesterthesis/code/code2_self_example/my-hroman-cr-file.ckpt   om
pi_global_snapshot_28952.ckpt/
[cbl1:28974] Checking for the existence of 
(/home/hroman/checkpoints/ompi_global_snapshot_28952.ckpt)
[cbl1:28974] Restarting from file (ompi_global_snapshot_28952.ckpt/)
[cbl1:28974]  Exec in self
ssh: connect to host 15 port 22: Invalid argument
--
A daemon (pid 28975) died unexpectedly with status 255 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--
--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--
hroman@cbl1 ~/checkpoints $ echo $LD_LIBRARY_PATH
/cluster/sw/blcr/0.8.2/x86_64/gcc//lib:/cluster/sw/openmpi/1.5.3_ft/x86_64/gcc/lib:/opt/intel/Compiler/11.1/056/lib/intel64

The library path seems to be ok or should it look different? do you have 
another idea?
cheers
roman


Von: users-boun...@open-mpi.org<mailto:users-boun...@open-mpi.org> 
[users-boun...@open-mpi.org<mailto:users-boun...@open-mpi.org>]" im Auftrag von 
"Nguyen Toan [nguyentoan1...@gmail.com<mailto:nguyentoan1...@gmail.com>]
Gesendet: Mittwoch, 6. April 2011 13:20
Bis: Open MPI Users
Betreff: Re: [OMPI users] openmpi self checkpointing - error while running 
example

Hi Roman,

Did you try to checkpoint and restart with the parameter "-machinefile". It may 
work.

Regards,
Nguyen Toan

On Wed, Apr 6, 2011 at 7:05 PM, Hellmüller Roman 
<hro...@student.ethz.ch<mailto:hro...@student.ethz.ch><mailto:hro...@student.ethz.ch<mailto:hro...@student.ethz.ch>>>
 wrote:
Hi

I'm trying to get fault tolerant ompi running on our cluster for my 
semesterthesis.

Build & compile were successful, blcr checkpointing works. openmpi 1.5.3, blcr 
0.8.2

Now i'm trying to set up the SELF checkpointing. the example from 
http://osl.iu.edu/research/ft/ompi-cr/examples.php does not work. I can run the 
application and also do ch

Re: [OMPI users] openmpi self checkpointing - error while running example

2011-04-06 Thread Nguyen Toan
Hi Roman,

It seems that you misunderstand the parameter "-machinefile".
Following this parameter shoud be a file containing a list of machines
which your MPI application will be run on. For example, you want to
run your app on 2 nodes, named "node1" and "node2", then this file, let call
it "MACHINES_FILE", should look like this:

node1
node2

Now try to checkpoint and restart again with "-machinefile MACHINES_FILE".
Hope it works.

On Wed, Apr 6, 2011 at 9:13 PM, Hellmüller Roman <hro...@student.ethz.ch>wrote:

> Hi Toan
>
> Thx for your suggestion. It gives me the following result, which does not
> tell anything more.
>
> hroman@cbl1 ~/checkpoints $ ompi-restart -v  -machinefile
> ../semesterthesis/code/code2_self_example/my-hroman-cr-file.ckpt   om
> pi_global_snapshot_28952.ckpt/
> [cbl1:28974] Checking for the existence of
> (/home/hroman/checkpoints/ompi_global_snapshot_28952.ckpt)
> [cbl1:28974] Restarting from file (ompi_global_snapshot_28952.ckpt/)
> [cbl1:28974]  Exec in self
> ssh: connect to host 15 port 22: Invalid argument
> --
> A daemon (pid 28975) died unexpectedly with status 255 while attempting
> to launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --
> --
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
> hroman@cbl1 ~/checkpoints $ echo $LD_LIBRARY_PATH
>
> /cluster/sw/blcr/0.8.2/x86_64/gcc//lib:/cluster/sw/openmpi/1.5.3_ft/x86_64/gcc/lib:/opt/intel/Compiler/11.1/056/lib/intel64
>
> The library path seems to be ok or should it look different? do you have
> another idea?
> cheers
> roman
>
> 
> Von: users-boun...@open-mpi.org [users-boun...@open-mpi.org]" im Auftrag
> von "Nguyen Toan [nguyentoan1...@gmail.com]
> Gesendet: Mittwoch, 6. April 2011 13:20
> Bis: Open MPI Users
> Betreff: Re: [OMPI users] openmpi self checkpointing - error while running
> example
>
> Hi Roman,
>
> Did you try to checkpoint and restart with the parameter "-machinefile". It
> may work.
>
> Regards,
> Nguyen Toan
>
> On Wed, Apr 6, 2011 at 7:05 PM, Hellmüller Roman <hro...@student.ethz.ch
> <mailto:hro...@student.ethz.ch>> wrote:
> Hi
>
> I'm trying to get fault tolerant ompi running on our cluster for my
> semesterthesis.
>
> Build & compile were successful, blcr checkpointing works. openmpi 1.5.3,
> blcr 0.8.2
>
> Now i'm trying to set up the SELF checkpointing. the example from
> http://osl.iu.edu/research/ft/ompi-cr/examples.php does not work. I can
> run the application and also do checkpoints, but restarting won't work.  I
> got the following error by doning as sugested:
>
> mpicc my-app.c -export -export-dynamic -o my-app
>
> mpirun -np 2 -am ft-enable-cr -mca crs_self_prefix my_personal my-app
>
> hroman@cbl1 ~ $ ompi-restart ompi_global_snapshot_27167.ckpt/
> --
> Error: Unable to obtain the proper restart command to restart from the
>  checkpoint file (opal_snapshot_0.ckpt). Returned -1.
>
> --
> --
> Error: Unable to obtain the proper restart command to restart from the
>  checkpoint file (opal_snapshot_1.ckpt). Returned -1.
>
> --
> --
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
>
> i also tryed around with setting the path in the example file (restart_path
> variable), changing the checkpoint directorys, and running the application
> in different directorys...
>
> do you have an idea where the error could be?
>
> here http://n.ethz.ch/~hroman/downloads/ompi_mailinglist.tar.gz<
> http://n.ethz.ch/%7Ehroman/downloads/ompi_maili

Re: [OMPI users] openmpi self checkpointing - error while running example

2011-04-06 Thread Hellmüller Roman
Hi Toan

Thx for your suggestion. It gives me the following result, which does not tell 
anything more.

hroman@cbl1 ~/checkpoints $ ompi-restart -v  -machinefile 
../semesterthesis/code/code2_self_example/my-hroman-cr-file.ckpt   om
pi_global_snapshot_28952.ckpt/
[cbl1:28974] Checking for the existence of 
(/home/hroman/checkpoints/ompi_global_snapshot_28952.ckpt)
[cbl1:28974] Restarting from file (ompi_global_snapshot_28952.ckpt/)
[cbl1:28974]  Exec in self
ssh: connect to host 15 port 22: Invalid argument
--
A daemon (pid 28975) died unexpectedly with status 255 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--
--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--
hroman@cbl1 ~/checkpoints $ echo $LD_LIBRARY_PATH
/cluster/sw/blcr/0.8.2/x86_64/gcc//lib:/cluster/sw/openmpi/1.5.3_ft/x86_64/gcc/lib:/opt/intel/Compiler/11.1/056/lib/intel64

The library path seems to be ok or should it look different? do you have 
another idea?
cheers
roman


Von: users-boun...@open-mpi.org [users-boun...@open-mpi.org]" im Auftrag von 
"Nguyen Toan [nguyentoan1...@gmail.com]
Gesendet: Mittwoch, 6. April 2011 13:20
Bis: Open MPI Users
Betreff: Re: [OMPI users] openmpi self checkpointing - error while running 
example

Hi Roman,

Did you try to checkpoint and restart with the parameter "-machinefile". It may 
work.

Regards,
Nguyen Toan

On Wed, Apr 6, 2011 at 7:05 PM, Hellmüller Roman 
<hro...@student.ethz.ch<mailto:hro...@student.ethz.ch>> wrote:
Hi

I'm trying to get fault tolerant ompi running on our cluster for my 
semesterthesis.

Build & compile were successful, blcr checkpointing works. openmpi 1.5.3, blcr 
0.8.2

Now i'm trying to set up the SELF checkpointing. the example from 
http://osl.iu.edu/research/ft/ompi-cr/examples.php does not work. I can run the 
application and also do checkpoints, but restarting won't work.  I got the 
following error by doning as sugested:

mpicc my-app.c -export -export-dynamic -o my-app

mpirun -np 2 -am ft-enable-cr -mca crs_self_prefix my_personal my-app

hroman@cbl1 ~ $ ompi-restart ompi_global_snapshot_27167.ckpt/
--
Error: Unable to obtain the proper restart command to restart from the
  checkpoint file (opal_snapshot_0.ckpt). Returned -1.

--
--
Error: Unable to obtain the proper restart command to restart from the
  checkpoint file (opal_snapshot_1.ckpt). Returned -1.

--
--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--

i also tryed around with setting the path in the example file (restart_path 
variable), changing the checkpoint directorys, and running the application in 
different directorys...

do you have an idea where the error could be?

here 
http://n.ethz.ch/~hroman/downloads/ompi_mailinglist.tar.gz<http://n.ethz.ch/%7Ehroman/downloads/ompi_mailinglist.tar.gz>
 (40MB) you'll find the library and the build of openmpi & blcr as well as the 
env variables and the output of ompi_info. there is one for the login and the 
other for the compute nodes due to different kernels.  and here 
http://n.ethz.ch/~hroman/downloads/ompi_global_snapshot_27167.ckpt.tar.gz<http://n.ethz.ch/%7Ehroman/downloads/ompi_global_snapshot_27167.ckpt.tar.gz>
 there is the produced checkpoint. please let me know if more outputs are 
needed.

cheers
roman

___
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] openmpi self checkpointing - error while running example

2011-04-06 Thread Nguyen Toan
Hi Roman,

Did you try to checkpoint and restart with the parameter "-machinefile". It
may work.

Regards,
Nguyen Toan

On Wed, Apr 6, 2011 at 7:05 PM, Hellmüller Roman wrote:

> Hi
>
> I'm trying to get fault tolerant ompi running on our cluster for my
> semesterthesis.
>
> Build & compile were successful, blcr checkpointing works. openmpi 1.5.3,
> blcr 0.8.2
>
> Now i'm trying to set up the SELF checkpointing. the example from
> http://osl.iu.edu/research/ft/ompi-cr/examples.php does not work. I can
> run the application and also do checkpoints, but restarting won't work.  I
> got the following error by doning as sugested:
>
> mpicc my-app.c -export -export-dynamic -o my-app
>
> mpirun -np 2 -am ft-enable-cr -mca crs_self_prefix my_personal my-app
>
> hroman@cbl1 ~ $ ompi-restart ompi_global_snapshot_27167.ckpt/
> --
> Error: Unable to obtain the proper restart command to restart from the
>   checkpoint file (opal_snapshot_0.ckpt). Returned -1.
>
> --
> --
> Error: Unable to obtain the proper restart command to restart from the
>   checkpoint file (opal_snapshot_1.ckpt). Returned -1.
>
> --
> --
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
>
> i also tryed around with setting the path in the example file (restart_path
> variable), changing the checkpoint directorys, and running the application
> in different directorys...
>
> do you have an idea where the error could be?
>
> here http://n.ethz.ch/~hroman/downloads/ompi_mailinglist.tar.gz<
> http://n.ethz.ch/%7Ehroman/downloads/ompi_mailinglist.tar.gz> (40MB)
> you'll find the library and the build of openmpi & blcr as well as the env
> variables and the output of ompi_info. there is one for the login and the
> other for the compute nodes due to different kernels.  and here
> http://n.ethz.ch/~hroman/downloads/ompi_global_snapshot_27167.ckpt.tar.gz<
> http://n.ethz.ch/%7Ehroman/downloads/ompi_global_snapshot_27167.ckpt.tar.gz>
> there is the produced checkpoint. please let me know if more outputs are
> needed.
>
> cheers
> roman
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


[OMPI users] openmpi self checkpointing - error while running example

2011-04-06 Thread Hellmüller Roman
Hi

I'm trying to get fault tolerant ompi running on our cluster for my 
semesterthesis.

Build & compile were successful, blcr checkpointing works. openmpi 1.5.3, blcr 
0.8.2

Now i'm trying to set up the SELF checkpointing. the example from 
http://osl.iu.edu/research/ft/ompi-cr/examples.php does not work. I can run the 
application and also do checkpoints, but restarting won't work.  I got the 
following error by doning as sugested:

mpicc my-app.c -export -export-dynamic -o my-app

mpirun -np 2 -am ft-enable-cr -mca crs_self_prefix my_personal my-app

hroman@cbl1 ~ $ ompi-restart ompi_global_snapshot_27167.ckpt/
--
Error: Unable to obtain the proper restart command to restart from the
   checkpoint file (opal_snapshot_0.ckpt). Returned -1.

--
--
Error: Unable to obtain the proper restart command to restart from the
   checkpoint file (opal_snapshot_1.ckpt). Returned -1.

--
--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--

i also tryed around with setting the path in the example file (restart_path 
variable), changing the checkpoint directorys, and running the application in 
different directorys...

do you have an idea where the error could be?

here 
http://n.ethz.ch/~hroman/downloads/ompi_mailinglist.tar.gz
 (40MB) you'll find the library and the build of openmpi & blcr as well as the 
env variables and the output of ompi_info. there is one for the login and the 
other for the compute nodes due to different kernels.  and here 
http://n.ethz.ch/~hroman/downloads/ompi_global_snapshot_27167.ckpt.tar.gz
 there is the produced checkpoint. please let me know if more outputs are 
needed.

cheers
roman