Hello,

I'm trying to come up with a fault tolerant OpenMPI setup for research
purposes. I'm doing some tests now, but I'm stuck with a segfault when
I try to restart my test program from a checkpoint.

My test program is the "ring" program, where messages are sent to the
next node in the ring N times. It's pretty simple, I can supply the
source code if needed. I'm running it like this:

# mpirun -np 4 -am ft-enable-cr ring
...
>>> Process 1 sending 703 to 2
>>> Process 3 received 704
>>> Process 3 sending 704 to 0
>>> Process 3 received 703
>>> Process 3 sending 703 to 0
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 18358 on node debian1
exited on signal 0 (Unknown signal 0).
--------------------------------------------------------------------------
4 total processes killed (some possibly by mpirun during cleanup)

That's the output when I ompi-checkpoint the mpirun PID from another terminal.

The checkpoint is taken just fine in maybe 1.5 seconds. I can see the
checkpoint directory has been created in $HOME.

This is what I get when I try to run ompi-restart

ps axroot@debian1:~# ps ax | grep mpirun
18357 pts/0    R+     0:01 mpirun -np 4 -am ft-enable-cr ring
18378 pts/5    S+     0:00 grep mpirun
root@debian1:~# ompi-checkpoint 18357
Snapshot Ref.:   0 ompi_global_snapshot_18357.ckpt
root@debian1:~# ompi-checkpoint --term 18357
Snapshot Ref.:   1 ompi_global_snapshot_18357.ckpt
root@debian1:~# ompi-restart ompi_global_snapshot_18357.ckpt
--------------------------------------------------------------------------
Error: Unable to obtain the proper restart command to restart from the
       checkpoint file (opal_snapshot_2.ckpt). Returned -1.

--------------------------------------------------------------------------
[debian1:18384] *** Process received signal ***
[debian1:18384] Signal: Segmentation fault (11)
[debian1:18384] Signal code: Address not mapped (1)
[debian1:18384] Failing at address: 0x725f725f
[debian1:18384] [ 0] [0xb775f40c]
[debian1:18384] [ 1]
/usr/local/lib/libopen-pal.so.0(opal_argv_free+0x33) [0xb771ea63]
[debian1:18384] [ 2]
/usr/local/lib/libopen-pal.so.0(opal_event_fini+0x30) [0xb77150a0]
[debian1:18384] [ 3]
/usr/local/lib/libopen-pal.so.0(opal_finalize+0x35) [0xb7708fa5]
[debian1:18384] [ 4] opal-restart [0x804908e]
[debian1:18384] [ 5] /lib/i686/cmov/libc.so.6(__libc_start_main+0xe5)
[0xb7568b55]
[debian1:18384] [ 6] opal-restart [0x8048fc1]
[debian1:18384] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 18384 on node debian1
exited on signal 11 (Segmentat
--------------------------------------------------------------------------

I used a clean install of Debian Squeeze (testing) to make sure my
environment was ok. Those are the steps I took:

- Installed Debian Squeeze, only base packages
- Installed build-essential, libcr0, libcr-dev, blcr-dkms (build
tools, BLCR dev and run-time environment)
- Compiled openmpi-1.4.1

Note that I did compile openmpi-1.4.1 because the Debian package
(openmpi-checkpoint) doesn't seem to be usable at the moment. There
are no leftovers from any previous install of Debian packages
supplying OpenMPI because this is a fresh install, no openmpi package
had been installed before.

I used the following configure options:

# ./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads

I also tried to add the option --with-memory-manager=none because I
saw an e-mail on the mailing list that described this as a possible
solution to an (apparently) not related problem, but the problem
remains the same.

I don't have config.log (I rm'ed the build dir), but if you think it's
necessary I can recompile OpenMPI and provide it.

Some information about the system (VirtualBox virtual machine, single
processor, btw):

Kernel version 2.6.32-trunk-686

root@debian1:~# lsmod | grep blcr
blcr                   79084  0
blcr_imports            2077  1 blcr

libcr (BLCR) is version 0.8.2-9.

gcc is version 4.4.3.


Please let me know of any other information you might need.


Thanks in advance,

Reply via email to