Hello,
I'm trying to come up with a fault tolerant OpenMPI setup for research purposes. I'm doing some tests now, but I'm stuck with a segfault when I try to restart my test program from a checkpoint. My test program is the "ring" program, where messages are sent to the next node in the ring N times. It's pretty simple, I can supply the source code if needed. I'm running it like this: # mpirun -np 4 -am ft-enable-cr ring ... >>> Process 1 sending 703 to 2 >>> Process 3 received 704 >>> Process 3 sending 704 to 0 >>> Process 3 received 703 >>> Process 3 sending 703 to 0 -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 18358 on node debian1 exited on signal 0 (Unknown signal 0). -------------------------------------------------------------------------- 4 total processes killed (some possibly by mpirun during cleanup) That's the output when I ompi-checkpoint the mpirun PID from another terminal. The checkpoint is taken just fine in maybe 1.5 seconds. I can see the checkpoint directory has been created in $HOME. This is what I get when I try to run ompi-restart ps axroot@debian1:~# ps ax | grep mpirun 18357 pts/0 R+ 0:01 mpirun -np 4 -am ft-enable-cr ring 18378 pts/5 S+ 0:00 grep mpirun root@debian1:~# ompi-checkpoint 18357 Snapshot Ref.: 0 ompi_global_snapshot_18357.ckpt root@debian1:~# ompi-checkpoint --term 18357 Snapshot Ref.: 1 ompi_global_snapshot_18357.ckpt root@debian1:~# ompi-restart ompi_global_snapshot_18357.ckpt -------------------------------------------------------------------------- Error: Unable to obtain the proper restart command to restart from the checkpoint file (opal_snapshot_2.ckpt). Returned -1. -------------------------------------------------------------------------- [debian1:18384] *** Process received signal *** [debian1:18384] Signal: Segmentation fault (11) [debian1:18384] Signal code: Address not mapped (1) [debian1:18384] Failing at address: 0x725f725f [debian1:18384] [ 0] [0xb775f40c] [debian1:18384] [ 1] /usr/local/lib/libopen-pal.so.0(opal_argv_free+0x33) [0xb771ea63] [debian1:18384] [ 2] /usr/local/lib/libopen-pal.so.0(opal_event_fini+0x30) [0xb77150a0] [debian1:18384] [ 3] /usr/local/lib/libopen-pal.so.0(opal_finalize+0x35) [0xb7708fa5] [debian1:18384] [ 4] opal-restart [0x804908e] [debian1:18384] [ 5] /lib/i686/cmov/libc.so.6(__libc_start_main+0xe5) [0xb7568b55] [debian1:18384] [ 6] opal-restart [0x8048fc1] [debian1:18384] *** End of error message *** -------------------------------------------------------------------------- mpirun noticed that process rank 2 with PID 18384 on node debian1 exited on signal 11 (Segmentat -------------------------------------------------------------------------- I used a clean install of Debian Squeeze (testing) to make sure my environment was ok. Those are the steps I took: - Installed Debian Squeeze, only base packages - Installed build-essential, libcr0, libcr-dev, blcr-dkms (build tools, BLCR dev and run-time environment) - Compiled openmpi-1.4.1 Note that I did compile openmpi-1.4.1 because the Debian package (openmpi-checkpoint) doesn't seem to be usable at the moment. There are no leftovers from any previous install of Debian packages supplying OpenMPI because this is a fresh install, no openmpi package had been installed before. I used the following configure options: # ./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads I also tried to add the option --with-memory-manager=none because I saw an e-mail on the mailing list that described this as a possible solution to an (apparently) not related problem, but the problem remains the same. I don't have config.log (I rm'ed the build dir), but if you think it's necessary I can recompile OpenMPI and provide it. Some information about the system (VirtualBox virtual machine, single processor, btw): Kernel version 2.6.32-trunk-686 root@debian1:~# lsmod | grep blcr blcr 79084 0 blcr_imports 2077 1 blcr libcr (BLCR) is version 0.8.2-9. gcc is version 4.4.3. Please let me know of any other information you might need. Thanks in advance,