[OMPI users] Checkpointing fails with BLCR 0.8.0b2

2008-12-04 Thread Matthias Hovestadt
Hi! Berkely recently released a new version of their BLCR. They already marked the function cr_request_file as deprecated in BLCR 0.7.3. Now they removed deprecated functions from libcr API. Since checkpointing support of OMPI is using cr_request_file, all checkpointing operations fail with

Re: [OMPI users] ompi-checkpoint is hanging

2008-10-31 Thread Matthias Hovestadt
Hi Tim! First of all: thanks a lot for answering! :-) Could you try running your two MPI jobs with fewer procs each, say 2 or 3 each instead of 4, so that there are a few extra cores available. This problem occurrs with any number of procs. Also, what happens to the checkpointing of one

[OMPI users] ompi-checkpoint is hanging

2008-10-31 Thread Matthias Hovestadt
Hi! I'm using the development version of OMPI from SVN (rev. 19857) for executing MPI jobs on my cluster system. I'm particularly using the checkpoint and restart feature, basing on the currentmost version of BLCR. The checkpointing is working pretty fine as long as I only execute a single job

Re: [OMPI users] Checkpointing a restarted app fails

2008-09-24 Thread Matthias Hovestadt
Hi Josh! I believe this is now fixed in the trunk. I was able to reproduce with the current trunk and committed a fix a few minutes ago in r19601. So the fix should be in tonight's tarball (or you can grab it from SVN). I've made a request to have the patch applied to v1.3, but that may take a

Re: [OMPI users] Checkpointing a restarted app fails

2008-09-18 Thread Matthias Hovestadt
Hi Josh! First of all, thanks a lot for replying. :-) When executing this checkpoint command, the running application directly aborts, even though I did not specify the "--term" option: -- mpirun noticed that process

Re: [OMPI users] Where is ompi-chekpoint?

2008-09-17 Thread Matthias Hovestadt
Hi! Hi, I have installed openmpi-1.2.7 with following instructions: ./configure --with-ft=cr --enable-ft-enable-thread --enable-mpi-thread --with-blcr=$HOME/blcr --prefix=$HOME/openmpi make all install In directory bin of directory $HOME/openmpi there is not ompi-checkpoint and ompi-restart.

[OMPI users] Checkpointing a restarted app fails

2008-09-17 Thread Matthias Hovestadt
Hi! Since I am interested in fault tolerance, checkpointing and restart of OMPI is an intersting feature for me. So I installed BLCR 0.7.3 as well as OMPI from SVN (rev. 19553). For OMPI I followed the instructions in the "Fault Tolerance Guide" in the OMPI wiki: ./autogen.sh ./configure

Re: [OMPI users] Checkpoint problem

2008-08-20 Thread Matthias Hovestadt
Hi Gabriele! In this case, mpirun works well, but the checkpoint procedure fails: ompi-checkpoint 20109 [node0316:20134] Error: Unable to get the current working directory [node0316:20134] [[42404,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line 395 [node0316:20134] HNP with