This type of failure is usually due to prelink'ing being left enabled on one or more of the systems. This has come up multiple times on the Open MPI list, but is actually a problem between BLCR and the Linux kernel. BLCR has a FAQ entry on this that you will want to check out:
  https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink

If that does not work, then we can look into other causes.

-- Josh

On Mar 5, 2010, at 3:06 AM, 马少杰 wrote:




2010-03-05
马少杰
Dear Sir:
I want to use openmpi and blcr to checkpoint.However, I want restart the check point
on other hosts.  For example, I run mpi program using openmpi on
host1 and host2, and I save the checkpoint file at a nfs shared path.
Then I wan to restart the job (ompi-restart -machinefile ma ompi_global_snapshot_15865.ckpt) on host3 and host4. The 4 host have same hardware and software. If I change the hostname (host3 and host4) on machinfile, the error always occur,
 [node182:27278] *** Process received signal ***
[node182:27278] Signal: Segmentation fault (11)
[node182:27278] Signal code: Address not mapped (1)
[node182:27278] Failing at address: 0x3b81009530
[node182:27275] *** Process received signal ***
[node182:27275] Signal: Segmentation fault (11)
[node182:27275] Signal code: Address not mapped (1)
[node182:27275] Failing at address: 0x3b81009530
[node182:27274] *** Process received signal ***
[node182:27274] Signal: Segmentation fault (11)
[node182:27274] Signal code: Address not mapped (1)
[node182:27274] Failing at address: 0x3b81009530
[node182:27276] *** Process received signal ***
[node182:27276] Signal: Segmentation fault (11)
[node182:27276] Signal code: Address not mapped (1)
[node182:27276] Failing at address: 0x3b81009530
--------------------------------------------------------------------------
mpirun noticed that process rank 9 with PID 27973 on node node183 exited on signal 11 (Segmentation fault).

if I comeback the hostname as host1 and host2, it can restart succesfully.

 my openmpi version is 1.3.4
./configure --with-ft=cr --enable-mpi-threads --enable-ft-thread -- with-blcr=$dir --with-blcr-libdir=/$dir/lib --prefix=$dir_ompi -- enable-mpirun-prefix-by-default

 the command run the mpi progrom as
mpirun -np 8 --am ft-enable-cr --mca opal_cr_use_thread 0 - machinefile ma ./cpi

vim $HOME/.openmpi/mca-params.conf
crs_base_snapshot_dir=/tmp/cr
snapc_base_global_snapshot_dir=/disk/cr


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to