This type of failure is usually due to prelink'ing being left enabled
on one or more of the systems. This has come up multiple times on the
Open MPI list, but is actually a problem between BLCR and the Linux
kernel. BLCR has a FAQ entry on this that you will want to check out:
https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink
If that does not work, then we can look into other causes.
-- Josh
On Mar 5, 2010, at 3:06 AM, 马少杰 wrote:
2010-03-05
马少杰
Dear Sir:
I want to use openmpi and blcr to checkpoint.However, I want
restart the check point
on other hosts. For example, I run mpi program using openmpi on
host1 and host2, and I save the checkpoint file at a nfs shared path.
Then I wan to restart the job (ompi-restart -machinefile ma
ompi_global_snapshot_15865.ckpt) on host3 and
host4. The 4 host have same hardware and software. If I change the
hostname (host3 and host4) on machinfile, the error always occur,
[node182:27278] *** Process received signal ***
[node182:27278] Signal: Segmentation fault (11)
[node182:27278] Signal code: Address not mapped (1)
[node182:27278] Failing at address: 0x3b81009530
[node182:27275] *** Process received signal ***
[node182:27275] Signal: Segmentation fault (11)
[node182:27275] Signal code: Address not mapped (1)
[node182:27275] Failing at address: 0x3b81009530
[node182:27274] *** Process received signal ***
[node182:27274] Signal: Segmentation fault (11)
[node182:27274] Signal code: Address not mapped (1)
[node182:27274] Failing at address: 0x3b81009530
[node182:27276] *** Process received signal ***
[node182:27276] Signal: Segmentation fault (11)
[node182:27276] Signal code: Address not mapped (1)
[node182:27276] Failing at address: 0x3b81009530
--------------------------------------------------------------------------
mpirun noticed that process rank 9 with PID 27973 on node node183
exited on signal 11 (Segmentation fault).
if I comeback the hostname as host1 and host2, it can restart
succesfully.
my openmpi version is 1.3.4
./configure --with-ft=cr --enable-mpi-threads --enable-ft-thread --
with-blcr=$dir --with-blcr-libdir=/$dir/lib --prefix=$dir_ompi --
enable-mpirun-prefix-by-default
the command run the mpi progrom as
mpirun -np 8 --am ft-enable-cr --mca opal_cr_use_thread 0 -
machinefile ma ./cpi
vim $HOME/.openmpi/mca-params.conf
crs_base_snapshot_dir=/tmp/cr
snapc_base_global_snapshot_dir=/disk/cr
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users