On Tue, 12 Aug 2014 09:51:21 +0000 Wouter Verhelst <[email protected]> wrote:
> Hi, > > I wrote: > > However, I've since noticed that there was actually an error in my > > script, which caused it to fail. This would probably explain why it > > wasn't doing any checkpoints... > > I've since debugged my script, and my initial tests show that > checkpointing a job with BLCR (due to subordinate queueing) and > restarting it on the same host later works perfectly. So far so good. > > I'm now also trying to migrate a job from one host to another, but > I'm bumping into an issue that I don't immediately see a solution for: > > When restarting a job, cr_restart will try to restart the job in > exactly the same context as it was before. This includes files we're > writing to, reading from, etc. Unfortunately, that also includes the > job script which we're actually running, which is > <execd_spooldir>/<hostname>/<jobid> or some such. When a job is > migrated to another host, the result is then that cr_restart tries to > open a file in the old host's spooldir, which no longer exists (the > files have been moved to the new host's spooldir). We worked around this here in the following manner. 1)execd_spooldir is local to each host rather than on a shared file system. 2)Before we start execd for the first time on a host we create a directory in execd_spooldir on the host called localhost and then add a symlink from hostname to localhost. When BLCR saves a file it uses the dereferenced file name(ie it sees the files under <execd_spooldir>/localhost/<jobid>) and everything is fine. We also set $TMPDIR to point to a directory created by the prolog without the queue name embedded in it for similar reasons(IIRC some later versions of grid engine create a TMPDIR without the queuename embedded anywat). There are some downsides to a host local execd_spooldir but we needed it for scaling reasons in any case. William
signature.asc
Description: PGP signature
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
