On Tue, 12 Aug 2014 09:51:21 +0000
Wouter Verhelst <[email protected]> wrote:

> Hi,
> 
> I wrote:
> > However, I've since noticed that there was actually an error in my
> > script, which caused it to fail. This would probably explain why it
> > wasn't doing any checkpoints...
> 
> I've since debugged my script, and my initial tests show that
> checkpointing a job with BLCR (due to subordinate queueing) and
> restarting it on the same host later works perfectly. So far so good.
> 
> I'm now also trying to migrate a job from one host to another, but
> I'm bumping into an issue that I don't immediately see a solution for:
> 
> When restarting a job, cr_restart will try to restart the job in
> exactly the same context as it was before. This includes files we're
> writing to, reading from, etc. Unfortunately, that also includes the
> job script which we're actually running, which is
> <execd_spooldir>/<hostname>/<jobid> or some such. When a job is
> migrated to another host, the result is then that cr_restart tries to
> open a file in the old host's spooldir, which no longer exists (the
> files have been moved to the new host's spooldir).
We worked around this here in the following manner.
1)execd_spooldir is local to each host rather than on a shared file
system.
2)Before we start execd for the first time on a host we create a
directory in execd_spooldir on the host called localhost and then add a
symlink from hostname to localhost.  

When BLCR saves a file it uses the dereferenced file name(ie it sees
the files under <execd_spooldir>/localhost/<jobid>) and everything is
fine. We also set $TMPDIR to point to a directory created by the
prolog without the queue name embedded in it for similar reasons(IIRC
some later versions of grid engine create a TMPDIR without the
queuename embedded anywat).

There are some downsides to a host local execd_spooldir but we needed
it for scaling reasons in any case.

William
 

Attachment: signature.asc
Description: PGP signature

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to