Eric, Thanks for the great work on this integration. I filed a ticket for the problem areas that you highlighted with the Open MPI side of the integration so we do not lose track of them. https://svn.open-mpi.org/trac/ompi/ticket/2842
Hopefully we will get some cycles to address these issues in the near term. Thanks, Josh On Wed, Jul 27, 2011 at 3:52 PM, Eric Roman <ero...@lbl.gov> wrote: > > Dear Open MPI Developers, > > We've been working on using Torque's checkpoint/restart support, along with > BLCR > and Open MPI's C/R support, to perform C/R on parallel jobs running under > Torque. The main issue here is that Open MPI requires the use of > ompi-checkpoint and ompi-restart commands to checkpoint the application, but > Torque uses cr_checkpoint and cr_restart to checkpoint job scripts, so an > adapter is required between the two interfaces. I've written a small program, > called cr_mpirun, that meets this purpose. > > This code is now available on the BLCR web site that should enable you to use > BLCR cr_checkpoint and cr_restart commands to checkpoint Open MPI > applications. > You can download it at the following URL: > > https://upc-bugs.lbl.gov/blcr-dist/cr_mpirun/cr_mpirun-210.tar.gz > > This code can be used fairly reliably to invoke cr_checkpoint and cr_restart > on > Open MPI applications. In turn, this enables you to use Torque's > checkpoint/restart support on parallel jobs. I've tested mainly with qhold > and > qrls, but have also experimented with using Maui's preemptee and preemptor > classes. > > This release is intended as a development release, meaning that this release > is > not suitable for general production use, but should be used for testing. > There > are a number of issues that need to be worked out, and we need feedback from > Torque and Open MPI developers, and from users interested in testing or filing > bug reports. > > There is a list of known issues documented in the BUGS file in the release. > There are HOWTO files in the release that describe the implementation, > workarounds for current problems, and usage of cr_mpirun. > > Thanks for your interest. > > Sincerely, > Eric Roman > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > -- Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey