On Tue, Mar 23, 2010 at 12:24 PM, fengguang tian <[email protected]> wrote: > Hi > > I am using open-mpi and blcr in a cluster of 3 machines, and the checkpoint > and restart work fine in single machine,but when doing checkpoint in > clusters environment, the ompi-checkpoint hangs
Besdies what has been said in another thread (regarding 1.4 and checkpointing to shared directories), you might want to make sure your app is terminated if you send a SIGTERM to it. Some apps might ignore SIGTERM or handle it in a way that does not cause the apps to quit. ompi-checkpoint --term is simply ompi-checkpoint + sending SIGTERM to the application (not sure whether SIGTERM is sent to each process individually or not).
