Re: [gridengine users] Trying to get checkpointing to work

Wouter Verhelst Tue, 12 Aug 2014 03:13:39 -0700

Hi,

I wrote:
> However, I've since noticed that there was actually an error in my
> script, which caused it to fail. This would probably explain why it
> wasn't doing any checkpoints...


I've since debugged my script, and my initial tests show that checkpointing a 
job with BLCR (due to subordinate queueing) and restarting it on the same host 
later works perfectly. So far so good.

I'm now also trying to migrate a job from one host to another, but I'm bumping 
into an issue that I don't immediately see a solution for:

When restarting a job, cr_restart will try to restart the job in exactly the 
same context as it was before. This includes files we're writing to, reading 
from, etc. Unfortunately, that also includes the job script which we're 
actually running, which is <execd_spooldir>/<hostname>/<jobid> or some such. 
When a job is migrated to another host, the result is then that cr_restart 
tries to open a file in the old host's spooldir, which no longer exists (the 
files have been moved to the new host's spooldir).

cr_restart has an option '--relocate' which would allow me to fix this issue, 
but then I would need to know the hostname of the host where the checkpoint was 
created. As far as I can see, that isn't any information that SGE stores, but I 
might be missing something...?

Thanks,

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Trying to get checkpointing to work

Reply via email to