Hi David,

Thanks for contacting us. It looks like DMTCP is unable to create temp
files when starting in the new queue/environment. Let me try and
explain the situation a bit and then we can discuss the possible
solutions.

DMTCP uses the environment variables DMTCP_TMPDIR, TMPDIR, and /tmp
(in this order) to determine the temp-directory. Now, the env vars
DMTCP_TMPDIR, and TMPDIR are initialized only once (during the initial
startup) and remain the same throughout the rest of the computation.
What I am guessing here is that the locations pointed to by
DMTCP_TMPDIR/TMPDIR don't exist in the new queue. Can you confirm if
this is correct?

If it turns out to be correct, there are two possible solutions:
1. We point DMTCP_TMPDIR to a location that is always accessible.
2. We modify DMTCP such that on restart, it looks at some particular
file etc. to find out the _current_ tmpdir.

The first solution seems much simpler but I am not sure if it can be
done in your environment. The second solution is more complicated and
we can write a simple DMTCP module which would update the env vars
DMTCP_TMPDIR/TMPDIR etc or restart.

Please let me know which solution works better for you.

Thanks,
Kapil


On Mon, Feb 27, 2012 at 8:01 PM, David Gabriel Simas
<[email protected]> wrote:
>
> Hello,
>
> I've integrated DMTCP (1.2.4 on Fedora 16 and Ubuntu 11.10) with Grid Engine
> (https://arc.liv.ac.uk/trac/SGE and http://gridscheduler.sourceforge.net/)
> and I've come across a curious problem.
>
> I've made my lap top a Grid Engine master, execution host and submission host,
> and have created two queues. When I submit a job to Grid Engine asking for
> DMTCP checking, everything works fine - checkponting, re-starting,
> re-checkpointing, re-re-starting, ... - as long as the job runs in the same
> queue. However, when I checkpoint and kill a job and force it to re-start on
> another queue, checkpointing that re-started job doesn't work. When the system
> initiates a checkpoint, the job core-dumps and issues the error message
>
>     [3308] mtcp.c:2333 perform_callback_write_ckpt_header:
>       error 2 creating temp file: No such file or directory
>     [3313] mtcp.c:2333 perform_callback_write_ckpt_header:
>       error 2 creating temp file: No such file or directory
>
> and the checkpoint command used (dmtcp_command -bc) has 99 as its exit status.
>
> N.B.: This job is re-starting on the same host, just in a different Grid 
> Engine
> queue from the original job.
>
> I'll be grateful for any suggestions.
>
> David Gabriel Simas
>
> ------------------------------------------------------------------------------
> Keep Your Developer Skills Current with LearnDevNow!
> The most comprehensive online learning library for Microsoft developers
> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
> Metro Style Apps, more. Free future releases when you subscribe now!
> http://p.sf.net/sfu/learndevnow-d2d
> _______________________________________________
> Dmtcp-forum mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to