[OMPI users] OpenMPI with BLCR runtime problem

2010-08-24 Thread
Dear OMPI users,

 

I configured and installed OpenMPI-1.4.2 and BLCR-0.8.2. (blade01 �C
blade10, nfs)

BLCR configure script: ./configure �Cprefix=/opt/blcr �Cenable-static

After the installation, I can see the ‘blcr’ module loaded correctly
(lsmod | grep blcr). And I can also run ‘cr_run’, ‘cr_checkpoint’,
‘cr_restart’ to C/R the examples correctly under /blcr/examples/.

Then, OMPI configure script is: ./configure �Cprefix=/opt/ompi �Cwith-ft=cr
�Cwith-blcr=/opt/blcr �Cenable-ft-thread �Cenable-mpi-threads �C
enable-static

The installation is okay too.

 

Then here comes the problem.

On one node:

 mpirun -np 2 ./hello_c.c

 mpirun -np 2 �Cam ft-enable-cr ./hello_c.c

 are both okay.

On two nodes(blade01, blade02):

 mpirun �Cnp 2 �Cmachinefile mf ./hello_c.c  OK.

mpirun �Cnp 2 �Cmachinefile mf �Cam ft-enable-cr ./hello_c.c ERROR. Listed
below:

 

*** An error occurred in MPI_Init 
*** before MPI was initialized 
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) 
[blade02:28896] Abort before MPI_INIT completed successfully; not able to
guarantee that all other processes were killed! 
-- 
It looks like opal_init failed for some reason; your parallel process is 
likely to abort. There are many reasons that a parallel process can 
fail during opal_init; some of which are due to configuration or 
environment problems. This failure appears to be an internal failure; 
here's some additional information (which may only be relevant to an 
Open MPI developer): 

  opal_cr_init() failed failed 
  --> Returned value -1 instead of OPAL_SUCCESS 
-- 
[blade02:28896] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file
runtime/orte_init.c at line 77 
-- 
It looks like MPI_INIT failed for some reason; your parallel process is 
likely to abort. There are many reasons that a parallel process can 
fail during MPI_INIT; some of which are due to configuration or environment 
problems. This failure appears to be an internal failure; here's some 
additional information (which may only be relevant to an Open MPI 
developer): 

  ompi_mpi_init: orte_init failed 
  --> Returned "Error" (-1) instead of "Success" (0) 
-- 

 

I have no idea about the error. Our blades use nfs, does it matter? Can
anyone help me solve the problem? I really appreciate it. Thank you.

 

btw, similar error like: 

“Oops, cr_init() failed (the initialization call to the BLCR checkpointing
system). Abort in despair.

The crmpi SSI subsystem failed to initialized modules successfully during
MPI_INIT. This is a fatal error; I must abort.” occurs when I use LAM/MPI +
BLCR.

 

Regards

 

whchen

 



Re: [OMPI users] OpenMPI with BLCR runtime problem

2010-08-25 Thread
I really thank you for your advice, Josh. As you say, when check 'lsmod |
grep blcr' on blade02, nothing shows. That means no blcr module is inserted
on blade02. I think that's the main reason why I can't C/R mpi programs on
these two nodes.
But here is the problem:
I installed blcr under /opt/blcr on blade01. Our blades use NFS. /opt/
directory and /home/ directory are shared. And on blade02, such commands
like 'cr_run', 'cr_restart' can be found. But I can't insert blcr module on
blade02. It shows:
insmod: error inserting '/opt/blcr/lib/blcr/2.6.16.60-0.21-smp/blcr.ko': -1
Unknown symbol in module
Does it mean that I have to install blcr on blade02? If so, where should I
install it? Just cover /opt/blcr or somewhere else?
Plz give me some advice. Thank you.


On Aug 24, 2010, at 10:27 AM, ?? wrote:

> Dear OMPI users,
>  
> I configured and installed OpenMPI-1.4.2 and BLCR-0.8.2. (blade01 ?C
blade10, nfs)
> BLCR configure script: ./configure ?Cprefix=/opt/blcr ?Cenable-static
> After the installation, I can see the ??blcr?? module loaded correctly
(lsmod | grep blcr). And I can also run ??cr_run??, ??cr_checkpoint??,
??cr_restart?? to C/R the examples correctly under /blcr/examples/.
> Then, OMPI configure script is: ./configure ?Cprefix=/opt/ompi
?Cwith-ft=cr ?Cwith-blcr=/opt/blcr ?Cenable-ft-thread ?Cenable-mpi-threads
?Cenable-static
> The installation is okay too.
>  
> Then here comes the problem.
> On one node:
>  mpirun -np 2 ./hello_c.c
>  mpirun -np 2 ?Cam ft-enable-cr ./hello_c.c
>  are both okay.
> On two nodes(blade01, blade02):
>  mpirun ?Cnp 2 ?Cmachinefile mf ./hello_c.c  OK.
> mpirun ?Cnp 2 ?Cmachinefile mf ?Cam ft-enable-cr ./hello_c.c ERROR. Listed
below:
>  
> *** An error occurred in MPI_Init 
> *** before MPI was initialized 
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) 
> [blade02:28896] Abort before MPI_INIT completed successfully; not able to
guarantee that all other processes were killed! 
> --

> It looks like opal_init failed for some reason; your parallel process is 
> likely to abort. There are many reasons that a parallel process can 
> fail during opal_init; some of which are due to configuration or 
> environment problems. This failure appears to be an internal failure; 
> here's some additional information (which may only be relevant to an 
> Open MPI developer):
>   opal_cr_init() failed failed 
>   --> Returned value -1 instead of OPAL_SUCCESS 
> --

> [blade02:28896] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file
runtime/orte_init.c at line 77 
> --

> It looks like MPI_INIT failed for some reason; your parallel process is 
> likely to abort. There are many reasons that a parallel process can 
> fail during MPI_INIT; some of which are due to configuration or
environment 
> problems. This failure appears to be an internal failure; here's some 
> additional information (which may only be relevant to an Open MPI 
> developer):
>   ompi_mpi_init: orte_init failed 
>   --> Returned "Error" (-1) instead of "Success" (0) 
> --
>  
> I have no idea about the error. Our blades use nfs, does it matter? Can
anyone help me solve the problem? I really appreciate it. Thank you.
>  
> btw, similar error like:
> ??Oops, cr_init() failed (the initialization call to the BLCR
checkpointing system). Abort in despair.
> The crmpi SSI subsystem failed to initialized modules successfully during
MPI_INIT. This is a fatal error; I must abort.?? occurs when I use LAM/MPI +
BLCR.

This seems to indicate that BLCR is not working correctly on one of the
compute nodes. Did you try some of the BLCR example programs on both of the
compute nodes? If BLCRs cr_init() fails, then there is not much the MPI
library can do for you.

I would check the installation of BLCR on all of the compute nodes (blade01
and blade02). Make sure the modules are loaded and that the BLCR single
process examples work on all nodes. I suspect that one of the nodes is
having trouble initializing the BLCR library.

You may also want to check to make sure prelinking is turned off on all
nodes as well:
  https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink

If that doesn't work then I would suggest trying the current Open MPI trunk.
There should not be any problem with using NFS, since this is occurring in
MPI_Init, this is well before we ever try to use the file system. I also
test with NFS, and local staging on a fairly regular basis, so it shouldn't
be a problem even when checkpointing/restarting.

-- Josh

>  
> Regards
>  
> whchen
>  
> 


Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://www.cs.indiana.edu/~jjhursey



Re: [OMPI users] OpenMPI with BLCR runtime problem

2010-08-25 Thread
I was so careless. BLCR Admin Guide says: as the root, load the kernel
modules in this order:
# /sbin/insmod /usr/local/lib/blcr/2.6.12-1.234/blcr_imports.ko
# /sbin/insmod /usr/local/lib/blcr/2.6.12-1.234/blcr.ko
In the last email, I load the kernel in the wrong order. And I followed the
order above, it succeeded. lol
I really thank you for your advice, Josh. Many thanks.

I really thank you for your advice, Josh. As you say, when check 'lsmod |
grep blcr' on blade02, nothing shows. That means no blcr module is inserted
on blade02. I think that's the main reason why I can't C/R mpi programs on
these two nodes.
But here is the problem:
I installed blcr under /opt/blcr on blade01. Our blades use NFS. /opt/
directory and /home/ directory are shared. And on blade02, such commands
like 'cr_run', 'cr_restart' can be found. But I can't insert blcr module on
blade02. It shows:
insmod: error inserting '/opt/blcr/lib/blcr/2.6.16.60-0.21-smp/blcr.ko': -1
Unknown symbol in module Does it mean that I have to install blcr on
blade02? If so, where should I install it? Just cover /opt/blcr or somewhere
else?
Plz give me some advice. Thank you.


On Aug 24, 2010, at 10:27 AM, ?? wrote:

> Dear OMPI users,
>  
> I configured and installed OpenMPI-1.4.2 and BLCR-0.8.2. (blade01 ?C 
> blade10, nfs) BLCR configure script: ./configure ?Cprefix=/opt/blcr 
> ?Cenable-static After the installation, I can see the ??blcr?? module
loaded correctly (lsmod | grep blcr). And I can also run ??cr_run??,
??cr_checkpoint??, ??cr_restart?? to C/R the examples correctly under
/blcr/examples/.
> Then, OMPI configure script is: ./configure ?Cprefix=/opt/ompi 
> ?Cwith-ft=cr ?Cwith-blcr=/opt/blcr ?Cenable-ft-thread ?Cenable-mpi-threads
?Cenable-static The installation is okay too.
>  
> Then here comes the problem.
> On one node:
>  mpirun -np 2 ./hello_c.c
>  mpirun -np 2 ?Cam ft-enable-cr ./hello_c.c
>  are both okay.
> On two nodes(blade01, blade02):
>  mpirun ?Cnp 2 ?Cmachinefile mf ./hello_c.c  OK.
> mpirun ?Cnp 2 ?Cmachinefile mf ?Cam ft-enable-cr ./hello_c.c ERROR. Listed
below:
>  
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [blade02:28896] 
> Abort before MPI_INIT completed successfully; not able to guarantee that
all other processes were killed!
> --
>  It looks like opal_init failed for some reason; your parallel 
> process is likely to abort. There are many reasons that a parallel 
> process can fail during opal_init; some of which are due to 
> configuration or environment problems. This failure appears to be an 
> internal failure; here's some additional information (which may only 
> be relevant to an Open MPI developer):
>   opal_cr_init() failed failed 
>   --> Returned value -1 instead of OPAL_SUCCESS
> --
>  [blade02:28896] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file 
> runtime/orte_init.c at line 77
> --
>  It looks like MPI_INIT failed for some reason; your parallel 
> process is likely to abort. There are many reasons that a parallel 
> process can fail during MPI_INIT; some of which are due to 
> configuration or environment problems. This failure appears to be an 
> internal failure; here's some additional information (which may only 
> be relevant to an Open MPI
> developer):
>   ompi_mpi_init: orte_init failed 
>   --> Returned "Error" (-1) instead of "Success" (0)
> --
> 
>  
> I have no idea about the error. Our blades use nfs, does it matter? Can
anyone help me solve the problem? I really appreciate it. Thank you.
>  
> btw, similar error like:
> ??Oops, cr_init() failed (the initialization call to the BLCR
checkpointing system). Abort in despair.
> The crmpi SSI subsystem failed to initialized modules successfully during
MPI_INIT. This is a fatal error; I must abort.?? occurs when I use LAM/MPI +
BLCR.

This seems to indicate that BLCR is not working correctly on one of the
compute nodes. Did you try some of the BLCR example programs on both of the
compute nodes? If BLCRs cr_init() fails, then there is not much the MPI
library can do for you.

I would check the installation of BLCR on all of the compute nodes (blade01
and blade02). Make sure the modules are loaded and that the BLCR single
process examples work on all nodes. I suspect that one of the nodes is
having trouble initializing the BLCR library.

You may also want to check to make sure prelinking is turned off on all
nodes as well:
  https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink

If that doesn't work then I would suggest trying the current Open MPI trunk.
There should not be any problem with using NFS, since this is occurring in
MPI_I

[OMPI users] Checkpoint problem with BLCR + OpenMPI

2010-08-27 Thread
Dear OMPI Users,

 

I have installed BLCR(0.8.2) and OpenMPI(1.4.2) successfully. But now I met
a problem when I take a checkpoint.

I run CG NPB(NPROCS=16, two nodes: blade02 & blade04, CLASS=C, NFS: $HOME &
/opt are shared)

 

BLCR configure: ./configure �Cprefix=/opt/blcr �Cenable-static

OpenMPi configure: ./configure �Cprefix=/opt/ompi �Cwith-ft=cr �C
with-blcr=/opt/blcr �Cenable-static (I didn’t add ‘enable-ft-thread’
param for I think it might affect the performance. Is it right?? And
mpi-threads are enabled by default, so I didn't add ‘enable-mpi-threads’
param) And Can anyone tell me these two params will make the checkpoint time
shorter or longer?

Our blades use NFS. $HOME and /opt are shared. The checkpoint file is
created in the $HOME directory by default. Will it cause the long checkpoint
time???

 

In $HOME/.openmpi/mca-params.conf:

crs_base_snapshot_dir=/tmp/

snapc_base_global_snapshot_dir=$HOME/ompi-cr-file

snapc_base_store_in_place=0

 

Then in mpirun terminal:

mpirun -machinefile mf -am ft-enable-cr -n 8 ./cg.C.8

 

In checkpoint terminal:

ompi-checkpoint --status 11133

[blade02:11171] Requested - Global Snapshot Reference:
(null)

[blade02:11171]   Pending - Global Snapshot Reference:
(null)

[blade02:11171]   Running - Global Snapshot Reference:
(null)

[blade02:11171] File Transfer - Global Snapshot Reference:
(null)

 

In mpirun terminal:

--

WARNING: Could not preload specified file: File already exists.

 

Fileset: $HOME/ompi-cr-file/ompi_global_snapshot_11133.ckpt/0

Host: blade02

 

Will continue attempting to launch the process.

 

--

[blade02:11133] 3 more processes have sent help message help-orte-filem-rsh.
txt / orte-filem-rsh:get-file-exists

[blade02:11133] Set MCA parameter "orte_base_help_aggregate" to 0 to see all
help / error messages

 

How to disable the ‘preload’ and how to solve this problems. Thanks.

 

Btw, when there is no mca-param.conf, and the checkpoint file is placed in
$HOME directory by default, I can checkpoint successfully. BUT, it takes a
very very long time to checkpoint. With no checkpoint, CG runs about 100s,
but with checkpoint, it runs 300s. 200% overhead ratio. WHY?

 

Regards

 

Whchen

 

 



[OMPI users] High Checkpoint Overhead Ratio

2010-08-30 Thread
Dear OMPI Users,

 

I’m now using BLCR-0.8.2 and OpenMPI-1.5rc5. The problem is that it takes a
very long time to checkpoint.

 

BLCR configuration:

./onfigure --prefix=/opt/blcr --enable-static

OpenMPi configuration:

./configure --prefix=/opt/ompi --with-ft=cr --with-blcr=/opt/blcr
--enable-static  --enable-ft-thread --enable-mpi-threads

 

Our blades use NFS. $HOME and /opt are shared.

 

In $HOME/.opnempi/mca-params.conf:

crs_base_snapshot_dir=/tmp/

snapc_base_global_snapshot_dir=/home/chenwh

snapc_basee_store_in_place=0

 

 

Now I run CG NPB (NPROCS=16, CLASS=C) on two nodes (blade02, blade04).

With no checkpoint, 'Time in seconds' is about 100s. It's normal.

But when I take a single checkpoint, 'Time in seconds' is up to 300s. The
overhead ratio is over 200%! WHY? How can I improve it?

 

blade02:~> ompi-checkpoint --status 27115

[blade02:27130] [  0.00 /   0.25] Requested - ...

[blade02:27130] [  0.00 /   0.25]   Pending - ...

[blade02:27130] [  0.21 /   0.46]   Running - ...

[blade02:27130] [221.25 / 221.71]  Finished -
ompi_global_snapshot_27115.ckpt

Snapshot Ref.:   0 ompi_global_snapshot_27115.ckpt

 

As you see, it takes 200+ secconds to checkpoint. btw, what the former and
latter number represent in [ , ]?

 

Regards

 

Whchen