Hi Artem, thanks for writing back. 

We're using DMTCP-2.0 and Torque-4.1.5.1. 


I'm a bit confused as to how to install a dmtcp plugin, or if in fact the 
Torque plugin is already installed by default. For example if I start up a 
nodes=2:ppn=2 PBS session, my $PBS_NODEFILE may look something like 


host1 
host1 
host2 
host2 


I then do 


dmtcp_launch --rm mpiexec -np 4 ./a.out (4-processor job successfully runs on 2 
processors on each of 2 nodes) 
dmtcp_command --checkpoint (in a separate window) 
dmtcp_command --kill (in a separate window) 
dmtcp_restart ckpt*.dmtcp 


After the last step, the job successfully restarts, but all 4 processes are now 
running on the localhost (host1), nothing is running on host2, and the 
$PBS_NODEFILE appears to be ignored. 


Thanks for any tips! 


Bryan 

----- Original Message -----



Hellp, Bryan. 


What version of DMTCP/Torque you use? 



2013/10/29 gene < [email protected] > 


> Perhaps this is something that is handled by the Torque plugin? 
Yes, that's correct. You'll need to use the DMTCP plugin for Torque. 
Artem Polyakov is supporting that, and I'm cc'ing to him. Among other 
issues, mount points can change and network addresses can change on restart. 
The plugin tries to handle that. 

Please let us know if you have any trouble using the Torque plugin. 

Best, 
- Gene 

On Mon, Oct 28, 2013 at 03:10:51PM -0400, Bryan F Putnam wrote: 
> 
> Dear DMTCP developers, 
> 
> I've found that when restarting a multi-node job, dmtcp_restart only appears 
> to be aware of the local host. Is it possible to tell dmtcp_restart which 
> hosts are currently available for a job restart, whether it's the same set of 
> multiple hosts, or a completely different set of hosts? 
> 
> Typically our hosts are contained in $PBS_NODEFILE since we use Torque. 
> Perhaps this is something that is handled by the Torque plugin? 
> 
> Thanks, 
> Bryan 
> 
> -- 
> Bryan Putnam 
> Senior Scientific Applications Analyst 
> Rosen Center for Advanced Computing, Purdue University 
> Young Hall (Rm. 910) 
> 155 S. Grant St. 
> West Lafayette, IN 47907-2114 
> Ph 765-496-8225 Fax 765-496-2275 
> [email protected] 
> www.purdue.edu/itap 




-- 
С Уважением, Поляков Артем Юрьевич 
Best regards, Artem Y. Polyakov 
------------------------------------------------------------------------------
Android is increasing in popularity, but the open development platform that
developers love is also attractive to malware creators. Download this white
paper to learn more about secure code signing practices that can help keep
Android apps secure.
http://pubads.g.doubleclick.net/gampad/clk?id=65839951&iu=/4140/ostg.clktrk
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to