Bryan, Resource manager plugin is installed by default. As far as I see you execute application correctly. Just in case I am attaching initial and restart batch scripts to this e-mail for reference. What is inside: at this moment (for debugging) I usually start dmtcp_coordinator at the frontend and use DMTCP options to point on it. We already have a solution how to run coordinator in batch manner too but untill you get correct behavior this is not reasonable. We test DMTCP with Open MPI mostly. Different MPI implementation also can be the reason but we need to check if that is so.
1. I need to additionally check Torque plugin by myself. This will take few days. We add 2. What application you run and is it possible for me to get it for testing with instructions about how to do that exactly as you do. 3. I have acces to Torque 2.x installations and we didn't test Torque 4.x. Is it possible for me to have access on your system for testing and debuggig? 2013/10/29 Bryan F Putnam <[email protected]> > Hi Artem, thanks for writing back. > > We're using DMTCP-2.0 and Torque-4.1.5.1. > > I'm a bit confused as to how to install a dmtcp plugin, or if in fact the > Torque plugin is already installed by default. For example if I start up a > nodes=2:ppn=2 PBS session, my $PBS_NODEFILE may look something like > > host1 > host1 > host2 > host2 > > I then do > > dmtcp_launch --rm mpiexec -np 4 ./a.out (4-processor job successfully > runs on 2 processors on each of 2 nodes) > dmtcp_command --checkpoint (in a separate window) > dmtcp_command --kill (in a separate window) > dmtcp_restart ckpt*.dmtcp > > After the last step, the job successfully restarts, but all 4 processes > are now running on the localhost (host1), nothing is running on host2, and > the $PBS_NODEFILE appears to be ignored. > > Thanks for any tips! > > Bryan > > ------------------------------ > > Hellp, Bryan. > > What version of DMTCP/Torque you use? > > > 2013/10/29 gene <[email protected]> > >> > Perhaps this is something that is handled by the Torque plugin? >> Yes, that's correct. You'll need to use the DMTCP plugin for Torque. >> Artem Polyakov is supporting that, and I'm cc'ing to him. Among other >> issues, mount points can change and network addresses can change on >> restart. >> The plugin tries to handle that. >> >> Please let us know if you have any trouble using the Torque plugin. >> >> Best, >> - Gene >> >> On Mon, Oct 28, 2013 at 03:10:51PM -0400, Bryan F Putnam wrote: >> > >> > Dear DMTCP developers, >> > >> > I've found that when restarting a multi-node job, dmtcp_restart only >> appears to be aware of the local host. Is it possible to tell dmtcp_restart >> which hosts are currently available for a job restart, whether it's the >> same set of multiple hosts, or a completely different set of hosts? >> > >> > Typically our hosts are contained in $PBS_NODEFILE since we use Torque. >> Perhaps this is something that is handled by the Torque plugin? >> > >> > Thanks, >> > Bryan >> > >> > -- >> > Bryan Putnam >> > Senior Scientific Applications Analyst >> > Rosen Center for Advanced Computing, Purdue University >> > Young Hall (Rm. 910) >> > 155 S. Grant St. >> > West Lafayette, IN 47907-2114 >> > Ph 765-496-8225 Fax 765-496-2275 >> > [email protected] >> > www.purdue.edu/itap >> > > > > -- > С Уважением, Поляков Артем Юрьевич > Best regards, Artem Y. Polyakov > > > -- С Уважением, Поляков Артем Юрьевич Best regards, Artem Y. Polyakov
hellompi_ckpt.job
Description: Binary data
hellompi_rstr.job
Description: Binary data
------------------------------------------------------------------------------ Android is increasing in popularity, but the open development platform that developers love is also attractive to malware creators. Download this white paper to learn more about secure code signing practices that can help keep Android apps secure. http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk
_______________________________________________ Dmtcp-forum mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
