[OMPI users] Checkpointing fails with BLCR 0.8.0b2

2008-12-04 Thread Matthias Hovestadt

Hi!

Berkely recently released a new version of their BLCR. They already
marked the function cr_request_file as deprecated in BLCR 0.7.3. Now
they removed deprecated functions from libcr API.

Since checkpointing support of OMPI is using cr_request_file, all
checkpointing operations fail with BLCR 0.8.0b2, making a downgrade
to BLCR 0.7.3 necessary.


Best,
Matthias


Re: [OMPI users] ompi-checkpoint is hanging

2008-10-31 Thread Matthias Hovestadt

Hi Tim!

First of all: thanks a lot for answering! :-)



Could you try running your two MPI jobs with fewer procs each,
say 2 or 3 each instead of 4, so that there are a few extra cores available.


This problem occurrs with any number of procs.


Also, what happens to the checkpointing of one MPI job if you kill the
other MPI job
after the first "hangs"?


Nothing, it keeps hanging.

> (It may not be a true hang, but very very slow progress that you
> are observing.)

I already waited for more than 12 hours, but the ompi-checkpoint
did not return. So if it's slow, it must be very slow.


I continued testing and just observed a case where the problem
occurred with only one job running on the compute node:

---
ccs@grid-demo-1:~$ ps auxww | grep mpirun | grep -v grep
ccs   7706  0.4  0.2  63864  2640 ?S15:35   0:00 mpirun 
-np 1 -am ft-enable-cr -np 6 
/home/ccs/XN-OMPI/testdrive/loop-1/remotedir/mpi-x-povray +I planet.pov 
-w1600 -h1200 +SP1 +O planet.tga

ccs@grid-demo-1:~$
---

The resource management system tried to checkpoint this job using the
command "ompi-checkpoint -v --term 7706". This is the output of that
command:

---
[grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: Checkpointing...
[grid-demo-1.cit.tu-berlin.de:08178] PID 7706
[grid-demo-1.cit.tu-berlin.de:08178] Connected to Mpirun [[3623,0],0]
[grid-demo-1.cit.tu-berlin.de:08178] Terminating after checkpoint
[grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: notify_hnp: 
Contact Head Node Process PID 7706
[grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: notify_hnp: 
Requested a checkpoint of jobid [INVALID]
[grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver: 
Receive a command message.
[grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver: 
Status Update.
[grid-demo-1.cit.tu-berlin.de:08178] Requested - Global 
Snapshot Reference: (null)
[grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver: 
Receive a command message.
[grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver: 
Status Update.
[grid-demo-1.cit.tu-berlin.de:08178] Pending (Termination) - Global 
Snapshot Reference: (null)
[grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver: 
Receive a command message.
[grid-demo-1.cit.tu-berlin.de:08178] orte_checkpoint: hnp_receiver: 
Status Update.
[grid-demo-1.cit.tu-berlin.de:08178]   Running - Global 
Snapshot Reference: (null)

---

If I look to the activity on the node, I see that the processes
are still computing:

---
  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 7710 ccs   25   0  327m 6936 4052 R  102  0.7   4:14.17 mpi-x-povray
 7712 ccs   25   0  327m 6884 4000 R  102  0.7   3:34.06 mpi-x-povray
 7708 ccs   25   0  327m 6896 4012 R   66  0.7   2:42.10 mpi-x-povray
 7707 ccs   25   0  331m  10m 3736 R   54  1.0   3:08.62 mpi-x-povray
 7709 ccs   25   0  327m 6940 4056 R   48  0.7   1:48.24 mpi-x-povray
 7711 ccs   25   0  327m 6724 4032 R   36  0.7   1:29.34 mpi-x-povray
---

Now I killed the hanging ompi-checkpoint operation and tried
to execute a checkpoint manually:

---
ccs@grid-demo-1:~$ ompi-checkpoint -v --term 7706
[grid-demo-1.cit.tu-berlin.de:08224] orte_checkpoint: Checkpointing...
[grid-demo-1.cit.tu-berlin.de:08224] PID 7706
[grid-demo-1.cit.tu-berlin.de:08224] Connected to Mpirun [[3623,0],0]
[grid-demo-1.cit.tu-berlin.de:08224] Terminating after checkpoint
[grid-demo-1.cit.tu-berlin.de:08224] orte_checkpoint: notify_hnp: 
Contact Head Node Process PID 7706
[grid-demo-1.cit.tu-berlin.de:08224] orte_checkpoint: notify_hnp: 
Requested a checkpoint of jobid [INVALID]

---

Is there perhaps a way of increasing the level of debug output?
Please let me know if I can support you in any way...


Best,
Matthias


[OMPI users] ompi-checkpoint is hanging

2008-10-31 Thread Matthias Hovestadt

Hi!

I'm using the development version of OMPI from SVN (rev. 19857)
for executing MPI jobs on my cluster system. I'm particularly using
the checkpoint and restart feature, basing on the currentmost version
of BLCR.

The checkpointing is working pretty fine as long as I only execute
a single job on a node. If more than one MPI application is executing
on a system, ompi-checkpoint sometimes does not return, hanging forever.


Example: checkpointing with a single running application

I'm using the MPI-enabled flavor of Povray as demo application. So I'm
starting it on a node using the following command.

  mpirun -np 4 mpi-x-povray +I planet.pov -w1200 -h1000 +SP1 \
  +O planet.tga

This gives me 4 MPI processes, all running on the local node.
checkpointing it with

  ompi-checkpoint -v --term 7022

(where 7022 is the PID of the mpirun process) gives me a checkpoint
dataset ompi_global_snapshot_7022.ckpt, that can be used for restarting
the job.

The ompi-checkpoint command gives the following output:

---
[grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: Checkpointing...
[grid-demo-1.cit.tu-berlin.de:07480] PID 7022
[grid-demo-1.cit.tu-berlin.de:07480] Connected to Mpirun [[2899,0],0]
[grid-demo-1.cit.tu-berlin.de:07480] Terminating after checkpoint
[grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: notify_hnp: 
Contact Head Node Process PID 7022
[grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: notify_hnp: 
Requested a checkpoint of jobid [INVALID]
[grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: 
Receive a command message.
[grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: 
Status Update.
[grid-demo-1.cit.tu-berlin.de:07480] Requested - Global 
Snapshot Reference: (null)
[grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: 
Receive a command message.
[grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: 
Status Update.
[grid-demo-1.cit.tu-berlin.de:07480] Pending (Termination) - Global 
Snapshot Reference: (null)
[grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: 
Receive a command message.
[grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: 
Status Update.
[grid-demo-1.cit.tu-berlin.de:07480]   Running - Global 
Snapshot Reference: (null)
[grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: 
Receive a command message.
[grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: 
Status Update.
[grid-demo-1.cit.tu-berlin.de:07480] File Transfer - Global 
Snapshot Reference: (null)
[grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: 
Receive a command message.
[grid-demo-1.cit.tu-berlin.de:07480] orte_checkpoint: hnp_receiver: 
Status Update.
[grid-demo-1.cit.tu-berlin.de:07480]  Finished - Global 
Snapshot Reference: ompi_global_snapshot_7022.ckpt

Snapshot Ref.:   0 ompi_global_snapshot_7022.ckpt
---



Example: checkpointing with two running applications

Similar to the first example, I'm again using the MPI-enabled flavor
of Povray as demo application. But now, I'm not only starting a single
Povray computation, but a second one in parallel. This gives me 8 MPI
processes (4 processes for each MPI job), so that the 8 cores of my
system are fully utilized

Without checkpointing, these two processes are executing without any
problem, each job resulting in a Povray image. However, if I'm using
the ompi-checkpoint command for checkpointing one of these two jobs,
this ompi-checkpoint is in danger of not returning.

Again I'm executing

  ompi-checkpoint -v --term 13572

(where 13752 is the PID of the mpirun process). This command gives
the following output, not returning back to the user:

---
[grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: Checkpointing...
[grid-demo-1.cit.tu-berlin.de:14252] PID 13572
[grid-demo-1.cit.tu-berlin.de:14252] Connected to Mpirun [[9529,0],0]
[grid-demo-1.cit.tu-berlin.de:14252] Terminating after checkpoint
[grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: notify_hnp: 
Contact Head Node Process PID 13572
[grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: notify_hnp: 
Requested a checkpoint of jobid [INVALID]
[grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: hnp_receiver: 
Receive a command message.
[grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: hnp_receiver: 
Status Update.
[grid-demo-1.cit.tu-berlin.de:14252] Requested - Global 
Snapshot Reference: (null)
[grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: hnp_receiver: 
Receive a command message.
[grid-demo-1.cit.tu-berlin.de:14252] orte_checkpoint: hnp_receiver: 
Status Update.
[grid-demo-1.cit.tu-berlin.de:14252] Pending (Termination) - Global 
Snapshot Reference: (null)

Re: [OMPI users] Checkpointing a restarted app fails

2008-09-24 Thread Matthias Hovestadt

Hi Josh!


I believe this is now fixed in the trunk. I was able to reproduce
with the current trunk and committed a fix a few minutes ago in
r19601. So the fix should be in tonight's tarball (or you can grab it
from SVN). I've made a request to have the patch applied to v1.3, but
that may take a day or so to complete.


I updated to 19607 and this really worked out. I'm now
able to checkpoint restarted applications without any
problems. Yippee!


Thanks for the bug report :)


Thanks for fixing it :-)


Best,
Matthias


Re: [OMPI users] Checkpointing a restarted app fails

2008-09-18 Thread Matthias Hovestadt

Hi Josh!

First of all, thanks a lot for replying. :-)


When executing this checkpoint command, the running application
directly aborts, even though I did not specify the "--term" option:

-- 

mpirun noticed that process rank 1 with PID 14050 on node 
grid-demo-1.cit.tu-berlin.de exited on signal 13 (Broken pipe).
-- 


ccs@grid-demo-1:~$


Interesting. This looks like a bug with the restart mechanism in Open 
MPI. This was working fine, but something must have changed in the trunk 
to break it.


Do you perhaps know a SVN revision number of OMPI that
is known to be working? If this issue is a regression
failure, I would be glad to use the source from an old
but working SVN state...

A useful piece of debugging information for me would be a stack trace 
from the failed process. You should be able to get this from a core file 
it left or If you would set the following MCA variable in 
$HOME/.openmpi/mca-params.conf:

  opal_cr_debug_sigpipe=1
This will cause the Open MPI app to wait in a sleep loop when it detects 
a Broken Pipe signal. Then you should be able to attach a debugger and 
retrieve a stack trace.


I created this file:

ccs@grid-demo-1:~$ cat .openmpi/mca-params.conf
opal_cr_debug_sigpipe=1
ccs@grid-demo-1:~$

Then I restarted the application from a checkpointed state
and tried to checkpoint this restarted application. Unfortunately
the restated application still terminates, despite of this para-
meter. However, the output slightly changed :


worker fetch area available 1
[grid-demo-1.cit.tu-berlin.de:26220] opal_cr: sigpipe_debug: Debug 
SIGPIPE [13]: PID (26220)

--
mpirun noticed that process rank 0 with PID 26248 on node 
grid-demo-1.cit.tu-berlin.de exited on signal 0 (Unknown signal 0).

--
2 total processes killed (some possibly by mpirun during cleanup)
ccs@grid-demo-1:~$


There is now this additional "opal_cr: sigpipe_debug" line, so
he apparently evaluates the .openmpi/mca-params.conf


I also tried to get a corefile by setting "ulimit -c 5", so
that ulimit -a gives me the following output:

ccs@grid-demo-1:~$ ulimit -a
core file size  (blocks, -c) 0
data seg size   (kbytes, -d) unlimited
scheduling priority (-e) 20
file size   (blocks, -f) unlimited
pending signals (-i) unlimited
max locked memory   (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files  (-n) 1024
pipe size(512 bytes, -p) 8
POSIX message queues (bytes, -q) unlimited
real-time priority  (-r) 0
stack size  (kbytes, -s) 8192
cpu time   (seconds, -t) unlimited
max user processes  (-u) unlimited
virtual memory  (kbytes, -v) unlimited
file locks  (-x) unlimited
ccs@grid-demo-1:~$

Unfortunately, no corefile is generated, so that I do not know
how to give you the requested stacktrace.

Are there perhaps other debug parameters I could use?


Best,
Matthias


Re: [OMPI users] Where is ompi-chekpoint?

2008-09-17 Thread Matthias Hovestadt

Hi!


Hi, I have installed openmpi-1.2.7 with following instructions:
./configure --with-ft=cr --enable-ft-enable-thread --enable-mpi-thread
--with-blcr=$HOME/blcr --prefix=$HOME/openmpi
make all install
In directory bin of directory $HOME/openmpi there is not ompi-checkpoint and
ompi-restart.


As far as I know, checkpointing support is not available
in OMPI 1.2.7. You have to use the devel version (1.3) of
OMPI, e.g. by checking out the source from SVN.


Best,
Matthias



[OMPI users] Checkpointing a restarted app fails

2008-09-17 Thread Matthias Hovestadt

Hi!

Since I am interested in fault tolerance, checkpointing and
restart of OMPI is an intersting feature for me. So I installed
BLCR 0.7.3 as well as OMPI from SVN (rev. 19553). For OMPI
I followed the instructions in the "Fault Tolerance Guide"
in the OMPI wiki:

./autogen.sh
./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads
make -s

This gave me an OMPI version with checkpointing support, so I
started testing. The good news is: I am able to checkpoint and
restart applications. The bad news is: checkpointing a restarted
application fails.

In detail:

1) Starting the application

ccs@grid-demo-1:~$ ompi-clean
ccs@grid-demo-1:~$ mpirun -np 2 -am ft-enable-cr yafaray-xml yafaray.xml

This starts my MPI-enabled application without any problems.


2) Checkpointing the application

First I queried the PID of the mpirun process:

ccs@grid-demo-1:~$ ps auxww | grep mpirun
ccs  13897  0.4  0.2  63992  2704 pts/0S+   04:59   0:00 mpirun 
-np 2 -am ft-enable-cr yafaray-xml yafaray.xml


Then I checkpointed the job, terminating it directly:

ccs@grid-demo-1:~$ ompi-checkpoint --term 13897
Snapshot Ref.:   0 ompi_global_snapshot_13897.ckpt
ccs@grid-demo-1:~$

The application indeed terminated:
--
mpirun noticed that process rank 0 with PID 13898 on node 
grid-demo-1.cit.tu-berlin.de exited on signal 0 (Unknown signal 0).

--
2 total processes killed (some possibly by mpirun during cleanup)

The checkpoint command generated a checkpoint dataset
of 367MB size:

ccs@grid-demo-1:~$ du -s -h ompi_global_snapshot_13897.ckpt/
367Mompi_global_snapshot_13897.ckpt/
ccs@grid-demo-1:~$



3) Restarting the application

For restarting the application, I first executed ompi-clean,
then restarting the job with preloading all files:

ccs@grid-demo-1:~$ ompi-clean
ccs@grid-demo-1:~$ ompi-restart --preload ompi_global_snapshot_13897.ckpt/

Restarting works pretty fine. The jobs restarts from the
checkpointed state and continues to execute. If not interrupted,
it continues until its end, returning a correct result.

However, I observed one weird thing: restarting the application
seemed to have the checkpoint dataset changed. Moreover, two new
directories have been created at restart time:

  4 drwx--  3 ccs  ccs4096 Sep 17 05:09 
ompi_global_snapshot_13897.ckpt

  4 drwx--  2 ccs  ccs4096 Sep 17 05:09 opal_snapshot_0.ckpt
  4 drwx--  2 ccs  ccs4096 Sep 17 05:09 opal_snapshot_1.ckpt



4) Checkpointing again

Again I first looked for the PID of the running mpirun process:

ccs@grid-demo-1:~$ ps auxww | grep mpirun
ccs  14005  0.0  0.2  63992  2736 pts/1S+   05:09   0:00 mpirun 
-am ft-enable-cr --app 
/home/ccs/ompi_global_snapshot_13897.ckpt/restart-appfile



Then I checkpointed it:

ccs@grid-demo-1:~$ ompi-checkpoint 14005


When executing this checkpoint command, the running application
directly aborts, even though I did not specify the "--term" option:

--
mpirun noticed that process rank 1 with PID 14050 on node 
grid-demo-1.cit.tu-berlin.de exited on signal 13 (Broken pipe).

--
ccs@grid-demo-1:~$


The "ompi-checkpoint 14005" command however does not return.



Is anybody here using checkpoint/restart capabilities of OMPI?
Did anybody encounter similar problems? Or is there something
wrong about my way of using ompi-checkpoint/ompi-restart?


Any hint is greatly appreciated! :-)



Best,
Matthias


Re: [OMPI users] Checkpoint problem

2008-08-20 Thread Matthias Hovestadt

Hi Gabriele!


In this case, mpirun works well, but the checkpoint procedure fails:

ompi-checkpoint 20109
[node0316:20134] Error: Unable to get the current working directory
[node0316:20134] [[42404,0],0] ORTE_ERROR_LOG: Not found in file
orte-checkpoint.c at line 395
[node0316:20134] HNP with PID 20109 Not found!


I had exactly the same problem on my machine. Neither modifying
the configure parameters nor the way of invoking the ompi-checkpoint
command did help. Since I am using the source from subversion checkout,
I also updated the source several times, following the day to day
progress. However, this problem remained.

Luckily, updating the source to SVN revision 19265 finally solved
this checkpointing issue. Maybe the problem shows up again in later
versions...


Best,
Matthias