now, I set $HOME as shared directory, but when doing ompi-checkpoint, it shows:(nimbus1 is the remote machine in my cluster)
[nimbus1:12630] opal_os_dirpath_create: Error: Unable to create the sub-directory (/home/mpiu/ompi_global_snapshot_1662.ckpt/0) of (/home/mpiu/ompi_global_snapshot_1662.ckpt/0/opal_snapshot_4.ckpt), mkdir failed [1] [nimbus1:12630] Error: No metadata filename specified! why is that? cheers fengguang On Tue, Mar 23, 2010 at 10:37 AM, Fernando Lemos <fernando...@gmail.com>wrote: > On Tue, Mar 23, 2010 at 12:24 PM, fengguang tian <ferny...@gmail.com> > wrote: > > Hi > > > > I am using open-mpi and blcr in a cluster of 3 machines, and the > checkpoint > > and restart work fine in single machine,but when doing checkpoint in > > clusters environment, the ompi-checkpoint hangs > > Besdies what has been said in another thread (regarding 1.4 and > checkpointing to shared directories), you might want to make sure your > app is terminated if you send a SIGTERM to it. Some apps might ignore > SIGTERM or handle it in a way that does not cause the apps to quit. > > ompi-checkpoint --term is simply ompi-checkpoint + sending SIGTERM to > the application (not sure whether SIGTERM is sent to each process > individually or not). > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >