Hello,

I've integrated DMTCP (1.2.4 on Fedora 16 and Ubuntu 11.10) with Grid Engine 
(https://arc.liv.ac.uk/trac/SGE and http://gridscheduler.sourceforge.net/)
and I've come across a curious problem.

I've made my lap top a Grid Engine master, execution host and submission host,
and have created two queues. When I submit a job to Grid Engine asking for
DMTCP checking, everything works fine - checkponting, re-starting, 
re-checkpointing, re-re-starting, ... - as long as the job runs in the same
queue. However, when I checkpoint and kill a job and force it to re-start on
another queue, checkpointing that re-started job doesn't work. When the system
initiates a checkpoint, the job core-dumps and issues the error message

     [3308] mtcp.c:2333 perform_callback_write_ckpt_header:
       error 2 creating temp file: No such file or directory
     [3313] mtcp.c:2333 perform_callback_write_ckpt_header:
       error 2 creating temp file: No such file or directory

and the checkpoint command used (dmtcp_command -bc) has 99 as its exit status.

N.B.: This job is re-starting on the same host, just in a different Grid Engine
queue from the original job.

I'll be grateful for any suggestions.

David Gabriel Simas

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Dmtcp-forum mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to