Re: [OMPI users] Checkpointing a restarted app fails

2008-09-24 Thread Matthias Hovestadt

Hi Josh!


I believe this is now fixed in the trunk. I was able to reproduce
with the current trunk and committed a fix a few minutes ago in
r19601. So the fix should be in tonight's tarball (or you can grab it
from SVN). I've made a request to have the patch applied to v1.3, but
that may take a day or so to complete.


I updated to 19607 and this really worked out. I'm now
able to checkpoint restarted applications without any
problems. Yippee!


Thanks for the bug report :)


Thanks for fixing it :-)


Best,
Matthias


Re: [OMPI users] Checkpointing a restarted app fails

2008-09-22 Thread Josh Hursey
I believe this is now fixed in the trunk. I was able to reproduce with  
the current trunk and committed a fix a few minutes ago in r19601. So  
the fix should be in tonight's tarball (or you can grab it from SVN).  
I've made a request to have the patch applied to v1.3, but that may  
take a day or so to complete.


Let me know if this fix eliminates your SIGPIPE issues.

Thanks for the bug report :)

Cheers,
Josh

On Sep 17, 2008, at 11:55 PM, Matthias Hovestadt wrote:


Hi Josh!

First of all, thanks a lot for replying. :-)


When executing this checkpoint command, the running application
directly aborts, even though I did not specify the "--term" option:

--
mpirun noticed that process rank 1 with PID 14050 on node grid- 
demo-1.cit.tu-berlin.de exited on signal 13 (Broken pipe).

--
ccs@grid-demo-1:~$
Interesting. This looks like a bug with the restart mechanism in  
Open MPI. This was working fine, but something must have changed in  
the trunk to break it.


Do you perhaps know a SVN revision number of OMPI that
is known to be working? If this issue is a regression
failure, I would be glad to use the source from an old
but working SVN state...

A useful piece of debugging information for me would be a stack  
trace from the failed process. You should be able to get this from  
a core file it left or If you would set the following MCA variable  
in $HOME/.openmpi/mca-params.conf:

 opal_cr_debug_sigpipe=1
This will cause the Open MPI app to wait in a sleep loop when it  
detects a Broken Pipe signal. Then you should be able to attach a  
debugger and retrieve a stack trace.


I created this file:

ccs@grid-demo-1:~$ cat .openmpi/mca-params.conf
opal_cr_debug_sigpipe=1
ccs@grid-demo-1:~$

Then I restarted the application from a checkpointed state
and tried to checkpoint this restarted application. Unfortunately
the restated application still terminates, despite of this para-
meter. However, the output slightly changed :


worker fetch area available 1
[grid-demo-1.cit.tu-berlin.de:26220] opal_cr: sigpipe_debug: Debug  
SIGPIPE [13]: PID (26220)

--
mpirun noticed that process rank 0 with PID 26248 on node grid- 
demo-1.cit.tu-berlin.de exited on signal 0 (Unknown signal 0).

--
2 total processes killed (some possibly by mpirun during cleanup)
ccs@grid-demo-1:~$


There is now this additional "opal_cr: sigpipe_debug" line, so
he apparently evaluates the .openmpi/mca-params.conf


I also tried to get a corefile by setting "ulimit -c 5", so
that ulimit -a gives me the following output:

ccs@grid-demo-1:~$ ulimit -a
core file size  (blocks, -c) 0
data seg size   (kbytes, -d) unlimited
scheduling priority (-e) 20
file size   (blocks, -f) unlimited
pending signals (-i) unlimited
max locked memory   (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files  (-n) 1024
pipe size(512 bytes, -p) 8
POSIX message queues (bytes, -q) unlimited
real-time priority  (-r) 0
stack size  (kbytes, -s) 8192
cpu time   (seconds, -t) unlimited
max user processes  (-u) unlimited
virtual memory  (kbytes, -v) unlimited
file locks  (-x) unlimited
ccs@grid-demo-1:~$

Unfortunately, no corefile is generated, so that I do not know
how to give you the requested stacktrace.

Are there perhaps other debug parameters I could use?


Best,
Matthias
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Checkpointing a restarted app fails

2008-09-18 Thread Matthias Hovestadt

Hi Josh!

First of all, thanks a lot for replying. :-)


When executing this checkpoint command, the running application
directly aborts, even though I did not specify the "--term" option:

-- 

mpirun noticed that process rank 1 with PID 14050 on node 
grid-demo-1.cit.tu-berlin.de exited on signal 13 (Broken pipe).
-- 


ccs@grid-demo-1:~$


Interesting. This looks like a bug with the restart mechanism in Open 
MPI. This was working fine, but something must have changed in the trunk 
to break it.


Do you perhaps know a SVN revision number of OMPI that
is known to be working? If this issue is a regression
failure, I would be glad to use the source from an old
but working SVN state...

A useful piece of debugging information for me would be a stack trace 
from the failed process. You should be able to get this from a core file 
it left or If you would set the following MCA variable in 
$HOME/.openmpi/mca-params.conf:

  opal_cr_debug_sigpipe=1
This will cause the Open MPI app to wait in a sleep loop when it detects 
a Broken Pipe signal. Then you should be able to attach a debugger and 
retrieve a stack trace.


I created this file:

ccs@grid-demo-1:~$ cat .openmpi/mca-params.conf
opal_cr_debug_sigpipe=1
ccs@grid-demo-1:~$

Then I restarted the application from a checkpointed state
and tried to checkpoint this restarted application. Unfortunately
the restated application still terminates, despite of this para-
meter. However, the output slightly changed :


worker fetch area available 1
[grid-demo-1.cit.tu-berlin.de:26220] opal_cr: sigpipe_debug: Debug 
SIGPIPE [13]: PID (26220)

--
mpirun noticed that process rank 0 with PID 26248 on node 
grid-demo-1.cit.tu-berlin.de exited on signal 0 (Unknown signal 0).

--
2 total processes killed (some possibly by mpirun during cleanup)
ccs@grid-demo-1:~$


There is now this additional "opal_cr: sigpipe_debug" line, so
he apparently evaluates the .openmpi/mca-params.conf


I also tried to get a corefile by setting "ulimit -c 5", so
that ulimit -a gives me the following output:

ccs@grid-demo-1:~$ ulimit -a
core file size  (blocks, -c) 0
data seg size   (kbytes, -d) unlimited
scheduling priority (-e) 20
file size   (blocks, -f) unlimited
pending signals (-i) unlimited
max locked memory   (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files  (-n) 1024
pipe size(512 bytes, -p) 8
POSIX message queues (bytes, -q) unlimited
real-time priority  (-r) 0
stack size  (kbytes, -s) 8192
cpu time   (seconds, -t) unlimited
max user processes  (-u) unlimited
virtual memory  (kbytes, -v) unlimited
file locks  (-x) unlimited
ccs@grid-demo-1:~$

Unfortunately, no corefile is generated, so that I do not know
how to give you the requested stacktrace.

Are there perhaps other debug parameters I could use?


Best,
Matthias


Re: [OMPI users] Checkpointing a restarted app fails

2008-09-17 Thread Josh Hursey


On Sep 16, 2008, at 11:18 PM, Matthias Hovestadt wrote:


Hi!

Since I am interested in fault tolerance, checkpointing and
restart of OMPI is an intersting feature for me. So I installed
BLCR 0.7.3 as well as OMPI from SVN (rev. 19553). For OMPI
I followed the instructions in the "Fault Tolerance Guide"
in the OMPI wiki:

./autogen.sh
./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads
make -s

This gave me an OMPI version with checkpointing support, so I
started testing. The good news is: I am able to checkpoint and
restart applications. The bad news is: checkpointing a restarted
application fails.

In detail:

1) Starting the application

ccs@grid-demo-1:~$ ompi-clean
ccs@grid-demo-1:~$ mpirun -np 2 -am ft-enable-cr yafaray-xml  
yafaray.xml


This starts my MPI-enabled application without any problems.


2) Checkpointing the application

First I queried the PID of the mpirun process:

ccs@grid-demo-1:~$ ps auxww | grep mpirun
ccs  13897  0.4  0.2  63992  2704 pts/0S+   04:59   0:00  
mpirun -np 2 -am ft-enable-cr yafaray-xml yafaray.xml


Then I checkpointed the job, terminating it directly:

ccs@grid-demo-1:~$ ompi-checkpoint --term 13897
Snapshot Ref.:   0 ompi_global_snapshot_13897.ckpt
ccs@grid-demo-1:~$

The application indeed terminated:
-- 

mpirun noticed that process rank 0 with PID 13898 on node grid- 
demo-1.cit.tu-berlin.de exited on signal 0 (Unknown signal 0).
-- 


2 total processes killed (some possibly by mpirun during cleanup)

The checkpoint command generated a checkpoint dataset
of 367MB size:

ccs@grid-demo-1:~$ du -s -h ompi_global_snapshot_13897.ckpt/
367Mompi_global_snapshot_13897.ckpt/
ccs@grid-demo-1:~$



3) Restarting the application

For restarting the application, I first executed ompi-clean,
then restarting the job with preloading all files:

ccs@grid-demo-1:~$ ompi-clean
ccs@grid-demo-1:~$ ompi-restart --preload  
ompi_global_snapshot_13897.ckpt/


Restarting works pretty fine. The jobs restarts from the
checkpointed state and continues to execute. If not interrupted,
it continues until its end, returning a correct result.

However, I observed one weird thing: restarting the application
seemed to have the checkpoint dataset changed. Moreover, two new
directories have been created at restart time:

  4 drwx--  3 ccs  ccs4096 Sep 17 05:09  
ompi_global_snapshot_13897.ckpt

  4 drwx--  2 ccs  ccs4096 Sep 17 05:09 opal_snapshot_0.ckpt
  4 drwx--  2 ccs  ccs4096 Sep 17 05:09 opal_snapshot_1.ckpt




The ('opal_snapshot_*.ckpt') directories are an artifact of the -- 
preload option. This option will copy the individual checkpoint to  
the remote machine before executing.




4) Checkpointing again

Again I first looked for the PID of the running mpirun process:

ccs@grid-demo-1:~$ ps auxww | grep mpirun
ccs  14005  0.0  0.2  63992  2736 pts/1S+   05:09   0:00  
mpirun -am ft-enable-cr --app /home/ccs/ 
ompi_global_snapshot_13897.ckpt/restart-appfile



Then I checkpointed it:

ccs@grid-demo-1:~$ ompi-checkpoint 14005


When executing this checkpoint command, the running application
directly aborts, even though I did not specify the "--term" option:

-- 

mpirun noticed that process rank 1 with PID 14050 on node grid- 
demo-1.cit.tu-berlin.de exited on signal 13 (Broken pipe).
-- 


ccs@grid-demo-1:~$


Interesting. This looks like a bug with the restart mechanism in Open  
MPI. This was working fine, but something must have changed in the  
trunk to break it.


A useful piece of debugging information for me would be a stack trace  
from the failed process. You should be able to get this from a core  
file it left or If you would set the following MCA variable in  
$HOME/.openmpi/mca-params.conf:

  opal_cr_debug_sigpipe=1
This will cause the Open MPI app to wait in a sleep loop when it  
detects a Broken Pipe signal. Then you should be able to attach a  
debugger and retrieve a stack trace.





The "ompi-checkpoint 14005" command however does not return.



Is anybody here using checkpoint/restart capabilities of OMPI?
Did anybody encounter similar problems? Or is there something
wrong about my way of using ompi-checkpoint/ompi-restart?


I work with the checkpoint/restart functionality on a daily basis,  
but I must admit that I haven't worked on the trunk in a few weeks.   
I'll take a look and let you know what I find. I suspect that Open  
MPI is not resetting properly after a checkpoint.





Any hint is greatly appreciated! :-)



Best,
Matthias
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] Checkpointing a restarted app fails

2008-09-17 Thread Matthias Hovestadt

Hi!

Since I am interested in fault tolerance, checkpointing and
restart of OMPI is an intersting feature for me. So I installed
BLCR 0.7.3 as well as OMPI from SVN (rev. 19553). For OMPI
I followed the instructions in the "Fault Tolerance Guide"
in the OMPI wiki:

./autogen.sh
./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads
make -s

This gave me an OMPI version with checkpointing support, so I
started testing. The good news is: I am able to checkpoint and
restart applications. The bad news is: checkpointing a restarted
application fails.

In detail:

1) Starting the application

ccs@grid-demo-1:~$ ompi-clean
ccs@grid-demo-1:~$ mpirun -np 2 -am ft-enable-cr yafaray-xml yafaray.xml

This starts my MPI-enabled application without any problems.


2) Checkpointing the application

First I queried the PID of the mpirun process:

ccs@grid-demo-1:~$ ps auxww | grep mpirun
ccs  13897  0.4  0.2  63992  2704 pts/0S+   04:59   0:00 mpirun 
-np 2 -am ft-enable-cr yafaray-xml yafaray.xml


Then I checkpointed the job, terminating it directly:

ccs@grid-demo-1:~$ ompi-checkpoint --term 13897
Snapshot Ref.:   0 ompi_global_snapshot_13897.ckpt
ccs@grid-demo-1:~$

The application indeed terminated:
--
mpirun noticed that process rank 0 with PID 13898 on node 
grid-demo-1.cit.tu-berlin.de exited on signal 0 (Unknown signal 0).

--
2 total processes killed (some possibly by mpirun during cleanup)

The checkpoint command generated a checkpoint dataset
of 367MB size:

ccs@grid-demo-1:~$ du -s -h ompi_global_snapshot_13897.ckpt/
367Mompi_global_snapshot_13897.ckpt/
ccs@grid-demo-1:~$



3) Restarting the application

For restarting the application, I first executed ompi-clean,
then restarting the job with preloading all files:

ccs@grid-demo-1:~$ ompi-clean
ccs@grid-demo-1:~$ ompi-restart --preload ompi_global_snapshot_13897.ckpt/

Restarting works pretty fine. The jobs restarts from the
checkpointed state and continues to execute. If not interrupted,
it continues until its end, returning a correct result.

However, I observed one weird thing: restarting the application
seemed to have the checkpoint dataset changed. Moreover, two new
directories have been created at restart time:

  4 drwx--  3 ccs  ccs4096 Sep 17 05:09 
ompi_global_snapshot_13897.ckpt

  4 drwx--  2 ccs  ccs4096 Sep 17 05:09 opal_snapshot_0.ckpt
  4 drwx--  2 ccs  ccs4096 Sep 17 05:09 opal_snapshot_1.ckpt



4) Checkpointing again

Again I first looked for the PID of the running mpirun process:

ccs@grid-demo-1:~$ ps auxww | grep mpirun
ccs  14005  0.0  0.2  63992  2736 pts/1S+   05:09   0:00 mpirun 
-am ft-enable-cr --app 
/home/ccs/ompi_global_snapshot_13897.ckpt/restart-appfile



Then I checkpointed it:

ccs@grid-demo-1:~$ ompi-checkpoint 14005


When executing this checkpoint command, the running application
directly aborts, even though I did not specify the "--term" option:

--
mpirun noticed that process rank 1 with PID 14050 on node 
grid-demo-1.cit.tu-berlin.de exited on signal 13 (Broken pipe).

--
ccs@grid-demo-1:~$


The "ompi-checkpoint 14005" command however does not return.



Is anybody here using checkpoint/restart capabilities of OMPI?
Did anybody encounter similar problems? Or is there something
wrong about my way of using ompi-checkpoint/ompi-restart?


Any hint is greatly appreciated! :-)



Best,
Matthias