Hi Lars,

Am 07.09.2012 um 16:55 schrieb Lars van der bijl:

> Hey everyone,
> 
> We have been using the grid for VFX for a few years and our job
> dependencies have grown a lot. A "job" is a collection of tasks. All
> our tasks have "batches" so we regularly run a job of 50 tasks with
> 2000+ batches.
> 
> Very often a batch dies for reasons such as memory limits, seg fault,
> or a kill command; getting a 139 or a 137 is also very common.
> This has the nasty side effect of the task being removed from the
> queue completely and raising a 100 in the epilog won't help.
> 
> Also, rescheduling a task often doesn't kill the task on the original
> host, causing the first host to corrupt the second host's output.

Then the integration of the job into the queuing system needs to be checked, 
i.e. whether there are any processes jumping out of the process tree, a:

$ qconf -sconf
#global:
...
execd_params                 ENABLE_ADDGRP_KILL=TRUE

might help.


> my question is how difficult would it be to get a task not to be
> removed from the queue but be placed in a "dormant" state, so that it
> can be re-activated for another run?

I'm quite confused by your requirements, as you stated above that they 
segfault. How can such a job be put in a dormant state, i.e. being suspended? 
Do you mean you want to requeue the job, so that it gets scheduled again?


> would it be possible to change the execd to put any job that does not
> exit with 0 into an error state? regardless of it being a kill -9?

You can rerun the job automatically if you exit the epilog with 99.

-- Reuti


> greetings,
> 
> Lars
> _______________________________________________
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to