Hey everyone, We have been using the grid for VFX for a few years and our job dependencies have grown a lot. A "job" is a collection of tasks. All our tasks have "batches" so we regularly run a job of 50 tasks with 2000+ batches.
Very often a batch dies for reasons such as memory limits, seg fault, or a kill command; getting a 139 or a 137 is also very common. This has the nasty side effect of the task being removed from the queue completely and raising a 100 in the epilog won't help. Also, rescheduling a task often doesn't kill the task on the original host, causing the first host to corrupt the second host's output. my question is how difficult would it be to get a task not to be removed from the queue but be placed in a "dormant" state, so that it can be re-activated for another run? would it be possible to change the execd to put any job that does not exit with 0 into an error state? regardless of it being a kill -9? greetings, Lars _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users