Hi!
Berkely recently released a new version of their BLCR. They already
marked the function cr_request_file as deprecated in BLCR 0.7.3. Now
they removed deprecated functions from libcr API.
Since checkpointing support of OMPI is using cr_request_file, all
checkpointing operations fail with
Hi Tim!
First of all: thanks a lot for answering! :-)
Could you try running your two MPI jobs with fewer procs each,
say 2 or 3 each instead of 4, so that there are a few extra cores available.
This problem occurrs with any number of procs.
Also, what happens to the checkpointing of one
Hi!
I'm using the development version of OMPI from SVN (rev. 19857)
for executing MPI jobs on my cluster system. I'm particularly using
the checkpoint and restart feature, basing on the currentmost version
of BLCR.
The checkpointing is working pretty fine as long as I only execute
a single job
Hi Josh!
I believe this is now fixed in the trunk. I was able to reproduce
with the current trunk and committed a fix a few minutes ago in
r19601. So the fix should be in tonight's tarball (or you can grab it
from SVN). I've made a request to have the patch applied to v1.3, but
that may take a
Hi Josh!
First of all, thanks a lot for replying. :-)
When executing this checkpoint command, the running application
directly aborts, even though I did not specify the "--term" option:
--
mpirun noticed that process
Hi!
Hi, I have installed openmpi-1.2.7 with following instructions:
./configure --with-ft=cr --enable-ft-enable-thread --enable-mpi-thread
--with-blcr=$HOME/blcr --prefix=$HOME/openmpi
make all install
In directory bin of directory $HOME/openmpi there is not ompi-checkpoint and
ompi-restart.
Hi!
Since I am interested in fault tolerance, checkpointing and
restart of OMPI is an intersting feature for me. So I installed
BLCR 0.7.3 as well as OMPI from SVN (rev. 19553). For OMPI
I followed the instructions in the "Fault Tolerance Guide"
in the OMPI wiki:
./autogen.sh
./configure
Hi Gabriele!
In this case, mpirun works well, but the checkpoint procedure fails:
ompi-checkpoint 20109
[node0316:20134] Error: Unable to get the current working directory
[node0316:20134] [[42404,0],0] ORTE_ERROR_LOG: Not found in file
orte-checkpoint.c at line 395
[node0316:20134] HNP with