Re: [gridengine users] Trying to get checkpointing to work

Reuti Tue, 12 Aug 2014 03:30:45 -0700

Hi,

Am 12.08.2014 um 11:51 schrieb Wouter Verhelst:


> I wrote:
>> However, I've since noticed that there was actually an error in my
>> script, which caused it to fail. This would probably explain why it
>> wasn't doing any checkpoints...
> 
> I've since debugged my script, and my initial tests show that checkpointing a 
> job with BLCR (due to subordinate queueing) and restarting it on the same 
> host later works perfectly. So far so good.
> 
> I'm now also trying to migrate a job from one host to another, but I'm 
> bumping into an issue that I don't immediately see a solution for:
> 
> When restarting a job, cr_restart will try to restart the job in exactly the 
> same context as it was before. This includes files we're writing to, reading 
> from, etc. Unfortunately, that also includes the job script which we're 
> actually running, which is <execd_spooldir>/<hostname>/<jobid> or some such. 
> When a job is migrated to another host, the result is then that cr_restart 
> tries to open a file in the old host's spooldir, which no longer exists (the 
> files have been moved to the new host's spooldir).
> 
> cr_restart has an option '--relocate' which would allow me to fix this issue, 
> but then I would need to know the hostname of the host where the checkpoint 
> was created. As far as I can see, that isn't any information that SGE stores, 
> but I might be missing something...?

Maybe you can submit the job as binary, i.e. `qsub -b y ...`. Then it will be 
execuded from the place where it is originally located. This means of course, 
that there is no staging of the script. Hence the script still has to exist at 
time of execution, while in the "-b n" case you could `qsub` a script and 
delete it, as the copy SGE made is transferred to the node and executed later 
on.

-- Reuti
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Trying to get checkpointing to work

Reply via email to