Hi Josh,
In case it help, I am running 1.3.3 compiled as follow :
../configure --enable-ft-thread --with-ft=cr --enable-mpi-threads
--with-blcr=... --with-blcr-libdir=...--disable-openib-rdmacm --prefix=....
I ran my application like this :
mpirun -am ft-enable-cr --hostfile host -np 2 ./a.out
where host contains:
node1
node2
This way it work if I checkpoint restart :
ompi-restart -hostfile host ompi_global_snapshot_....ckpt
but if I then change the host to (just swapping nodes):
node2
node1
then it crash...
thanks
Josh Hursey wrote:
Though I do not test this scenario (using hostfiles) very often, it
used to work. The ompi-restart command takes a --hostfile (or
--machinefile) argument that is passed directly to the mpirun command.
I wonder if something broke recently with this handoff. I can
certainly checkpoint with one set of nodes/allocation and restart with
another, but most/all of my testing occurs in a SLURM environment, so
no need for an explicit hostfile.
I'll take a look to see if I can reproduce, but probably will not be
until next week.
-- Josh
On Dec 2, 2009, at 9:54 AM, Jonathan Ferland wrote:
Hi,
I am trying to use BLCR checkpointing in mpi. I am currently able to
run my application using some hostfile, checkpoint the run, and then
restart the application using the same hostfile. The thing I would
like to do is to restart the application with a different hostfile.
But this leads to a segfault using 1.3.3.
Is it possible to restart the application using a different hostfile
(we are using pbs to create the hostfile, so each new restart might
be on different nodes), how can we do that? If no, do you plan to
include this in a future release?
thanks
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users