Re: [gridengine users] task exit status problems

Lars van der bijl Fri, 07 Sep 2012 14:14:41 -0700

On 7 September 2012 19:41, Reuti <re...@staff.uni-marburg.de> wrote:
> Am 07.09.2012 um 18:39 schrieb Lars van der bijl:
>
>> On 7 September 2012 17:48, Reuti <re...@staff.uni-marburg.de> wrote:
>>> Am 07.09.2012 um 17:45 schrieb Lars van der bijl:
>>>
>>>> On 7 September 2012 17:23, Reuti <re...@staff.uni-marburg.de> wrote:
>>>>>> would it be possible to change the execd to put any job that does not
>>>>>> exit with 0 into an error state? regardless of it being a kill -9?
>>>>>
>>>>> You can rerun the job automatically if you exit the epilog with 99.
>>>>
>>>> yes but with 137 or 139 i can't. and as the task hasn't successfully
>>>> finished i don't want it to start it's dependencies. i'd rather it
>>>> just go to a error state.
>>>
>>> You observe, that a job being rescheduled by exit 99 will trigger its 
>>> successors by -hold_jid to start?
>>>
>> no when i'm able to raise a 99 exit status it will not trigger it's
>> dependencies. however a task killed because of 137 or 139 do.
>> and I'd rather them error out with 100 them to be removed from the
>> queue all together.
>>
>> i know that the grid uses 137 when you request a qdel. and this makes
>> it kinda hard to stop a task if anything else would be put in a 100
>> error state.
>
> No, the chain of commands is the other way round. The `qdel` will send 
> sigkill to the job and remove it from the list of jobs in the system 
> (whatever you do or set in the epilog doesn't matter, as the job is to be 
> removed by the `qdel`).
>
> You can for example:
>
> - Submit all jobs with a user hold of the successor(s), this user hold you 
> can be removed in the epilog of the predecessor if it ran successful. The 
> name/jobid of the successor to be released could be put in a job context 
> which you have to read in the epilog and act accordingly.


I could create this with my database layer however our system relies
very heavily on batching. so task1 -> task2 with the same batch range
but with different batch sizes. for example 1-100:25 for task1 and
1-100:1 for task2. how would I be able to find out what the other
range is and how would i be able to un-hold that specific range?

>
> - Create a special queue for some kind of `enabler' jobs which run forever 
> (loop e.g. once a minute until they quit), the original job will create/touch 
> a special file for which the `enabler' is waiting. If the existence of the 
> relevant file is detected, the `enabler' can release a hold of a certain job 
> or even just submit the successor job.
>
> - Creating a workflow can be done with: http://wildfire.bii.a-star.edu.sg/ 
> tool GEL http://wildfire.bii.a-star.edu.sg/docs/gel_ref.pdf where you can 
> check for files. But the jobs will be submitted during the workflow and not 
> all in advance. Maybe it is useful anyway.
>
> -- Reuti.

it would still be nice to know if it where possible to know implement
the "dormant" task approach. the company I work for would be willing
to pay for such development. depending on the feasibility.

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] task exit status problems

Reply via email to