Dear OMPI users,
I configured and installed OpenMPI-1.4.2 and BLCR-0.8.2. (blade01 �C blade10, nfs) BLCR configure script: ./configure �Cprefix=/opt/blcr �Cenable-static After the installation, I can see the ‘blcr’ module loaded correctly (lsmod | grep blcr). And I can also run ‘cr_run’, ‘cr_checkpoint’, ‘cr_restart’ to C/R the examples correctly under /blcr/examples/. Then, OMPI configure script is: ./configure �Cprefix=/opt/ompi �Cwith-ft=cr �Cwith-blcr=/opt/blcr �Cenable-ft-thread �Cenable-mpi-threads �C enable-static The installation is okay too. Then here comes the problem. On one node: mpirun -np 2 ./hello_c.c mpirun -np 2 �Cam ft-enable-cr ./hello_c.c are both okay. On two nodes(blade01, blade02): mpirun �Cnp 2 �Cmachinefile mf ./hello_c.c OK. mpirun �Cnp 2 �Cmachinefile mf �Cam ft-enable-cr ./hello_c.c ERROR. Listed below: *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [blade02:28896] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! -------------------------------------------------------------------------- It looks like opal_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during opal_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): opal_cr_init() failed failed --> Returned value -1 instead of OPAL_SUCCESS -------------------------------------------------------------------------- [blade02:28896] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file runtime/orte_init.c at line 77 -------------------------------------------------------------------------- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): ompi_mpi_init: orte_init failed --> Returned "Error" (-1) instead of "Success" (0) -------------------------------------------------------------------------- I have no idea about the error. Our blades use nfs, does it matter? Can anyone help me solve the problem? I really appreciate it. Thank you. btw, similar error like: “Oops, cr_init() failed (the initialization call to the BLCR checkpointing system). Abort in despair. The crmpi SSI subsystem failed to initialized modules successfully during MPI_INIT. This is a fatal error; I must abort.” occurs when I use LAM/MPI + BLCR. Regards whchen