Hi Kit, I am so sorry for the late response. Somehow I missed this email earlier. I am CC'ing Artem Polyakov who wrote the TORQUE plugin for DMTCP.
Thanks, Kapil On Wed, Jan 30, 2013 at 2:52 PM, Kit Menlove <[email protected]> wrote: > Hi all, > > > > I’m using a cluster that uses Torque as the batch system. About half of the > time, checkpointing fails while copying the temporary output buffer/file > with the following error: > > > > [27763] ERROR at connection.cpp:1214 in CopyFile; > REASON='JASSERT(_real_system(command.c_str()) != -1) failed' > > > > The generic system command is “cp -f > /var/spool/torque/spool/jobid.myserver.OU > /checkpoint_dir/ckpt_myprog_52b886013bb1c112-27763-51060104_files/jobid.myserver.OU_99001” > > > > I’m using dmtcp_checkpoint (v1.2.6) with the --checkpoint-open-files option. > Is anyone familiar with Torque enough to suggest why the file might not > exist at the time of checkpointing, or what else might be the cause of the > CopyFile failure? > > > > Thanks, > > Kit > > > > > ------------------------------------------------------------------------------ > Everyone hates slow websites. So do we. > Make your web apps faster with AppDynamics > Download AppDynamics Lite for free today: > http://p.sf.net/sfu/appdyn_d2d_jan > _______________________________________________ > Dmtcp-forum mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > ------------------------------------------------------------------------------ Own the Future-Intel(R) Level Up Game Demo Contest 2013 Rise to greatness in Intel's independent game demo contest. Compete for recognition, cash, and the chance to get your game on Steam. $5K grand prize plus 10 genre and skill prizes. Submit your demo by 6/6/13. http://altfarm.mediaplex.com/ad/ck/12124-176961-30367-2 _______________________________________________ Dmtcp-forum mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
