Thanks Josh,
Just yesterday I stumbled upon another interesting detail about this
issue. While reconfiguring things, I accidentally ran as root, and the
checkpointing all succeeded. I'm not sure though how to go about
finding what file things are hanging up on. I've compared straces as
roo
It sounds like there is a race happening in the shutdown of the
processes. I wonder if the app is shutting down in a way that mpirun
does not quite like.
I have not tested the C/R functionality in the 1.4 series in a long
time. Can you give it a try with the 1.5 series, and see if there is
any var
Hi Everyone.
I've been trying to figure out an issue with ompi-checkpoint/blcr. The
symptoms seem to be related to what filesystem the
snapc_base_global_snapshot_dir is located on.
I wrote a simple mpi program where rank 0 sends to 1, 1 to 2, etc. then
the highest sends to 0. then it waits