Hey everyone,

We have been using the grid for VFX for a few years and our job
dependencies have grown a lot. A "job" is a collection of tasks. All
our tasks have "batches" so we regularly run a job of 50 tasks with
2000+ batches.

Very often a batch dies for reasons such as memory limits, seg fault,
or a kill command; getting a 139 or a 137 is also very common.
This has the nasty side effect of the task being removed from the
queue completely and raising a 100 in the epilog won't help.

Also, rescheduling a task often doesn't kill the task on the original
host, causing the first host to corrupt the second host's output.

my question is how difficult would it be to get a task not to be
removed from the queue but be placed in a "dormant" state, so that it
can be re-activated for another run?

would it be possible to change the execd to put any job that does not
exit with 0 into an error state? regardless of it being a kill -9?

greetings,

Lars
_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to