Re: [OMPI users] ompi-checkpoint problem on shared storage

2011-09-27 Thread Dave Schulz
Thanks Josh, Just yesterday I stumbled upon another interesting detail about this issue. While reconfiguring things, I accidentally ran as root, and the checkpointing all succeeded. I'm not sure though how to go about finding what file things are hanging up on. I've compared straces as roo

Re: [OMPI users] ompi-checkpoint problem on shared storage

2011-09-23 Thread Josh Hursey
It sounds like there is a race happening in the shutdown of the processes. I wonder if the app is shutting down in a way that mpirun does not quite like. I have not tested the C/R functionality in the 1.4 series in a long time. Can you give it a try with the 1.5 series, and see if there is any var

[OMPI users] ompi-checkpoint problem on shared storage

2011-09-23 Thread Dave Schulz
Hi Everyone. I've been trying to figure out an issue with ompi-checkpoint/blcr. The symptoms seem to be related to what filesystem the snapc_base_global_snapshot_dir is located on. I wrote a simple mpi program where rank 0 sends to 1, 1 to 2, etc. then the highest sends to 0. then it waits