On Sun, Feb 28, 2010 at 11:11 PM, Fernando Lemos <fernando...@gmail.com> wrote: > Hello, > > > I'm trying to come up with a fault tolerant OpenMPI setup for research > purposes. I'm doing some tests now, but I'm stuck with a segfault when > I try to restart my test program from a checkpoint. > > My test program is the "ring" program, where messages are sent to the > next node in the ring N times. It's pretty simple, I can supply the > source code if needed. I'm running it like this: > > # mpirun -np 4 -am ft-enable-cr ring > ... >>>> Process 1 sending 703 to 2 >>>> Process 3 received 704 >>>> Process 3 sending 704 to 0 >>>> Process 3 received 703 >>>> Process 3 sending 703 to 0 > -------------------------------------------------------------------------- > mpirun noticed that process rank 0 with PID 18358 on node debian1 > exited on signal 0 (Unknown signal 0). > -------------------------------------------------------------------------- > 4 total processes killed (some possibly by mpirun during cleanup) > > That's the output when I ompi-checkpoint the mpirun PID from another terminal. > > The checkpoint is taken just fine in maybe 1.5 seconds. I can see the > checkpoint directory has been created in $HOME. > > This is what I get when I try to run ompi-restart > > ps axroot@debian1:~# ps ax | grep mpirun > 18357 pts/0 R+ 0:01 mpirun -np 4 -am ft-enable-cr ring > 18378 pts/5 S+ 0:00 grep mpirun > root@debian1:~# ompi-checkpoint 18357 > Snapshot Ref.: 0 ompi_global_snapshot_18357.ckpt > root@debian1:~# ompi-checkpoint --term 18357 > Snapshot Ref.: 1 ompi_global_snapshot_18357.ckpt > root@debian1:~# ompi-restart ompi_global_snapshot_18357.ckpt > -------------------------------------------------------------------------- > Error: Unable to obtain the proper restart command to restart from the > checkpoint file (opal_snapshot_2.ckpt). Returned -1. > > -------------------------------------------------------------------------- > [debian1:18384] *** Process received signal *** > [debian1:18384] Signal: Segmentation fault (11) > [debian1:18384] Signal code: Address not mapped (1) > [debian1:18384] Failing at address: 0x725f725f > [debian1:18384] [ 0] [0xb775f40c] > [debian1:18384] [ 1] > /usr/local/lib/libopen-pal.so.0(opal_argv_free+0x33) [0xb771ea63] > [debian1:18384] [ 2] > /usr/local/lib/libopen-pal.so.0(opal_event_fini+0x30) [0xb77150a0] > [debian1:18384] [ 3] > /usr/local/lib/libopen-pal.so.0(opal_finalize+0x35) [0xb7708fa5] > [debian1:18384] [ 4] opal-restart [0x804908e] > [debian1:18384] [ 5] /lib/i686/cmov/libc.so.6(__libc_start_main+0xe5) > [0xb7568b55] > [debian1:18384] [ 6] opal-restart [0x8048fc1] > [debian1:18384] *** End of error message *** > -------------------------------------------------------------------------- > mpirun noticed that process rank 2 with PID 18384 on node debian1 > exited on signal 11 (Segmentat > -------------------------------------------------------------------------- > > I used a clean install of Debian Squeeze (testing) to make sure my > environment was ok. Those are the steps I took: > > - Installed Debian Squeeze, only base packages > - Installed build-essential, libcr0, libcr-dev, blcr-dkms (build > tools, BLCR dev and run-time environment) > - Compiled openmpi-1.4.1 > > Note that I did compile openmpi-1.4.1 because the Debian package > (openmpi-checkpoint) doesn't seem to be usable at the moment. There > are no leftovers from any previous install of Debian packages > supplying OpenMPI because this is a fresh install, no openmpi package > had been installed before. > > I used the following configure options: > > # ./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads > > I also tried to add the option --with-memory-manager=none because I > saw an e-mail on the mailing list that described this as a possible > solution to an (apparently) not related problem, but the problem > remains the same. > > I don't have config.log (I rm'ed the build dir), but if you think it's > necessary I can recompile OpenMPI and provide it. > > Some information about the system (VirtualBox virtual machine, single > processor, btw): > > Kernel version 2.6.32-trunk-686 > > root@debian1:~# lsmod | grep blcr > blcr 79084 0 > blcr_imports 2077 1 blcr > > libcr (BLCR) is version 0.8.2-9. > > gcc is version 4.4.3. > > > Please let me know of any other information you might need. > > > Thanks in advance, >
Hello, I figured it out. The problem is that the Debian package brcl-utils, which contains the BLCR binaries (cr_restart, cr_checkpoint, etc.) wasn't installed. I believe OpenMPI could perhaps show a more descriptive message instead of segfaulting, though? Also, you might want to add that information to the FAQ. Anyways, I'm filing another Debian bug report. For the sake of completeness, here's, some more information: - I forgot to mention that since I've installed OpenMPI to /usr/local. So I'm setting LD_LIBRARY_PATH to /usr/lib:/usr/local/lib in .bashrc, and thus I can run any OpenMPI command without problems. - I tested BLCR with cr_checkpoint and cr_restart with a simple app, and it worked great too. - I've purged /usr/local and rebuilt OpenMPI with the mentioned flags to obtain the attached config.log (gzipped). - With brcl-utils installed, I can ompi-restart just fine. Without it installed, I get the segfault mentioned in my previous message. Best regards,
config.log.gz
Description: GNU Zip compressed data