Dear OMPI users,

 

I configured and installed OpenMPI-1.4.2 and BLCR-0.8.2. (blade01 �C
blade10, nfs)

BLCR configure script: ./configure �Cprefix=/opt/blcr �Cenable-static

After the installation, I can see the ‘blcr’ module loaded correctly
(lsmod | grep blcr). And I can also run ‘cr_run’, ‘cr_checkpoint’,
‘cr_restart’ to C/R the examples correctly under /blcr/examples/.

Then, OMPI configure script is: ./configure �Cprefix=/opt/ompi �Cwith-ft=cr
�Cwith-blcr=/opt/blcr �Cenable-ft-thread �Cenable-mpi-threads �C
enable-static

The installation is okay too.

 

Then here comes the problem.

On one node:

         mpirun -np 2 ./hello_c.c

         mpirun -np 2 �Cam ft-enable-cr ./hello_c.c

         are both okay.

On two nodes(blade01, blade02):

         mpirun �Cnp 2 �Cmachinefile mf ./hello_c.c  OK.

mpirun �Cnp 2 �Cmachinefile mf �Cam ft-enable-cr ./hello_c.c ERROR. Listed
below:

 

*** An error occurred in MPI_Init 
*** before MPI was initialized 
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) 
[blade02:28896] Abort before MPI_INIT completed successfully; not able to
guarantee that all other processes were killed! 
-------------------------------------------------------------------------- 
It looks like opal_init failed for some reason; your parallel process is 
likely to abort. There are many reasons that a parallel process can 
fail during opal_init; some of which are due to configuration or 
environment problems. This failure appears to be an internal failure; 
here's some additional information (which may only be relevant to an 
Open MPI developer): 

  opal_cr_init() failed failed 
  --> Returned value -1 instead of OPAL_SUCCESS 
-------------------------------------------------------------------------- 
[blade02:28896] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file
runtime/orte_init.c at line 77 
-------------------------------------------------------------------------- 
It looks like MPI_INIT failed for some reason; your parallel process is 
likely to abort. There are many reasons that a parallel process can 
fail during MPI_INIT; some of which are due to configuration or environment 
problems. This failure appears to be an internal failure; here's some 
additional information (which may only be relevant to an Open MPI 
developer): 

  ompi_mpi_init: orte_init failed 
  --> Returned "Error" (-1) instead of "Success" (0) 
-------------------------------------------------------------------------- 

 

I have no idea about the error. Our blades use nfs, does it matter? Can
anyone help me solve the problem? I really appreciate it. Thank you.

 

btw, similar error like: 

“Oops, cr_init() failed (the initialization call to the BLCR checkpointing
system). Abort in despair.

The crmpi SSI subsystem failed to initialized modules successfully during
MPI_INIT. This is a fatal error; I must abort.” occurs when I use LAM/MPI +
BLCR.

 

Regards

 

whchen

 

Reply via email to