Hi David, Thanks for contacting us. It looks like DMTCP is unable to create temp files when starting in the new queue/environment. Let me try and explain the situation a bit and then we can discuss the possible solutions.
DMTCP uses the environment variables DMTCP_TMPDIR, TMPDIR, and /tmp (in this order) to determine the temp-directory. Now, the env vars DMTCP_TMPDIR, and TMPDIR are initialized only once (during the initial startup) and remain the same throughout the rest of the computation. What I am guessing here is that the locations pointed to by DMTCP_TMPDIR/TMPDIR don't exist in the new queue. Can you confirm if this is correct? If it turns out to be correct, there are two possible solutions: 1. We point DMTCP_TMPDIR to a location that is always accessible. 2. We modify DMTCP such that on restart, it looks at some particular file etc. to find out the _current_ tmpdir. The first solution seems much simpler but I am not sure if it can be done in your environment. The second solution is more complicated and we can write a simple DMTCP module which would update the env vars DMTCP_TMPDIR/TMPDIR etc or restart. Please let me know which solution works better for you. Thanks, Kapil On Mon, Feb 27, 2012 at 8:01 PM, David Gabriel Simas <[email protected]> wrote: > > Hello, > > I've integrated DMTCP (1.2.4 on Fedora 16 and Ubuntu 11.10) with Grid Engine > (https://arc.liv.ac.uk/trac/SGE and http://gridscheduler.sourceforge.net/) > and I've come across a curious problem. > > I've made my lap top a Grid Engine master, execution host and submission host, > and have created two queues. When I submit a job to Grid Engine asking for > DMTCP checking, everything works fine - checkponting, re-starting, > re-checkpointing, re-re-starting, ... - as long as the job runs in the same > queue. However, when I checkpoint and kill a job and force it to re-start on > another queue, checkpointing that re-started job doesn't work. When the system > initiates a checkpoint, the job core-dumps and issues the error message > > [3308] mtcp.c:2333 perform_callback_write_ckpt_header: > error 2 creating temp file: No such file or directory > [3313] mtcp.c:2333 perform_callback_write_ckpt_header: > error 2 creating temp file: No such file or directory > > and the checkpoint command used (dmtcp_command -bc) has 99 as its exit status. > > N.B.: This job is re-starting on the same host, just in a different Grid > Engine > queue from the original job. > > I'll be grateful for any suggestions. > > David Gabriel Simas > > ------------------------------------------------------------------------------ > Keep Your Developer Skills Current with LearnDevNow! > The most comprehensive online learning library for Microsoft developers > is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, > Metro Style Apps, more. Free future releases when you subscribe now! > http://p.sf.net/sfu/learndevnow-d2d > _______________________________________________ > Dmtcp-forum mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum ------------------------------------------------------------------------------ Keep Your Developer Skills Current with LearnDevNow! The most comprehensive online learning library for Microsoft developers is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, Metro Style Apps, more. Free future releases when you subscribe now! http://p.sf.net/sfu/learndevnow-d2d _______________________________________________ Dmtcp-forum mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
