Hi all, This is the procedure i have followed to install openmpi. Is there some installation or environment setting problem in here? an openmpi program with 4 process is run across 2 dual-core intel machines, with 2 processes running on each of the machine.
ompi-checkpoint is successful but ompi-restart fails with following error $:> ompi-restart ompi_global_snapshot_6045.ckpt -------------------------------------------------------------------------- mpirun noticed that process rank 0 with PID 6372 on node acl-cadi-pentd-1.cse.buffalo.edu exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- Open-mpi installation steps: ./configure --prefix=/home/csgrad/audhakne/.openmpi --with-ft=cr --with-blcr=/usr/lib64 --enable-debug make make install export LD_LIBRARY_PATH=$HOME/.openmpi/lib/:$HOME/.openmpi/lib/openmpi:/usr/lib64 export PATH=$HOME/.openmpi/bin:$PATH NOTE: blcr is installed as a module $:> lsmod | grep blcr blcr 117892 0 blcr_vmadump 58264 1 blcr blcr_imports 46080 2 blcr,blcr_vmadump Please let me know if there is problem with above procedure, thanks a lot for your time. Best. ---------- Forwarded message ---------- From: arun dhakne <arundha...@gmail.com> List-Post: users@lists.open-mpi.org Date: Tue, Sep 30, 2008 at 12:52 AM Subject: ompi-restart issue : ompi-restart doesn't work across nodes To: Open MPI Users <us...@open-mpi.org> Hi all, I had gone through some previous ompi-restart issues but i couldn't find anything similar to this problem. I have installed blcr, and configured open-mpi 'openmpi-1.3a1r19645' i) If the sample mpi program say ( np 4 on single machine that is without any hostfile )is ran and I try to checkpoint it, it happens successfully and even ompi-restart works in this case. ii) If the sample mpi program is ran across say 2 different nodes and checkpoint happens successfully BUT ompi-restart throws following error: $ ompi-restart ompi_global_snapshot_7604.ckpt -------------------------------------------------------------------------- mpirun noticed that process rank 3 with PID 9590 on node acl-cadi-pentd-1.cse.buffalo.edu exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- Please let me know if more information is needed. -- Thanks and Regards, Arun U. Dhakne