It sounds like there is a race happening in the shutdown of the processes. I wonder if the app is shutting down in a way that mpirun does not quite like.
I have not tested the C/R functionality in the 1.4 series in a long time. Can you give it a try with the 1.5 series, and see if there is any variation? You might also try the trunk, but I have not tested it recently enough to know if things are still working correctly or not (have others?). I'll file a ticket so we do not lose track of the bug. Hopefully we will get to it soon. https://svn.open-mpi.org/trac/ompi/ticket/2872 Thanks, Josh On Fri, Sep 23, 2011 at 3:08 PM, Dave Schulz <dsch...@ucalgary.ca> wrote: > Hi Everyone. > > I've been trying to figure out an issue with ompi-checkpoint/blcr. The > symptoms seem to be related to what filesystem the > snapc_base_global_snapshot_dir is located on. > > I wrote a simple mpi program where rank 0 sends to 1, 1 to 2, etc. then the > highest sends to 0. then it waits 1 sec and repeats. > > I'm using openmpi-1.4.3 and when I run "ompi-checkpoint --term > <pidofmpirun>" on the shared filesystems, the ompi-checkpoint returns a > checkpoint reference, the worker processes go away, but the mpirun remains > but is stuck (It dies right away if I run kill on it -- so it's responding > to SIGTERM). If I attach an strace to the mpirun, I get the following from > strace forever: > > poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, > {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=0, events=POLLIN}], 6, > 1000) = 0 (Timeout) > poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, > {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=0, events=POLLIN}], 6, > 1000) = 0 (Timeout) > poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, > {fd=8, events=POLLIN}, {fd=10, events=POLLIN}, {fd=0, events=POLLIN}], 6, > 1000) = 0 (Timeout) > > I'm running with: > mpirun -machinefile machines -am ft-enable-cr ./mpiloop > the "machines" file simply has the local hostname listed a few times. I've > tried 2 and 8. I can try up to 24 as this node is a pretty big one if it's > deemed useful. Also, there's 256Gb of RAM. And it's Opteron 6 core, 4 > socket if that helps. > > > I initially installed this on a test system with only local harddisks and > standard nfs on Centos 5.6 where everything worked as expected. When I > moved over to the production system things started breaking. The filesystem > is the major software difference. The shared filesystems are Ibrix and that > is where the above symptoms started to appear. > > I haven't even moved on to multi-node mpi runs as I can't even get this to > work for any number of processes on the local machine except if I set the > checkpoint directory to /tmp which is on a local xfs harddisk. If I put the > checkpoints on any shared directory, things fail. > > I've tried a number of *_verbose mca parameters and none of them seem to > issue any messages at the point of checkpoint, only when I give-up and send > kill `pidof mpirun` are there any further messages. > > openmpi is compiled with: > ./configure --prefix=/global/software/openmpi-blcr > --with-blcr=/global/software/blcr > --with-blcr-libdir=/global/software/blcr/lib/ --with-ft=cr > --enable-ft-thread --enable-mpi-threads --with-openib --with-tm > > and blcr only has a prefix to put it in /global/software/blcr otherwise it's > vanilla. Both are compiled with the default gcc. > > One final note, is that occasionally it does succeed and terminate. But it > seems completely random. > > What I'm wondering is has anyone else seen symptoms like this -- especially > where the mpirun doesn't quit after a checkpoint with --term but the worker > processes do? > > Also, is there some sort of somewhat unusual filesystem semantic that our > shared filesystem may not support that ompi/ompi-checkpoint is needing? > > Thanks for any insights you may have. > > -Dave > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > > -- Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey