Re: [gridengine users] task exit status problems

Lars van der bijl Mon, 10 Sep 2012 01:22:12 -0700

On 8 September 2012 00:09, Reuti <re...@staff.uni-marburg.de> wrote:
>
> Am 07.09.2012 um 23:12 schrieb Lars van der bijl:
>
>> On 7 September 2012 19:41, Reuti <re...@staff.uni-marburg.de> wrote:
>>> Am 07.09.2012 um 18:39 schrieb Lars van der bijl:
>>>
>>>> On 7 September 2012 17:48, Reuti <re...@staff.uni-marburg.de> wrote:
>>>>> Am 07.09.2012 um 17:45 schrieb Lars van der bijl:
>>>>>
>>>>>> On 7 September 2012 17:23, Reuti <re...@staff.uni-marburg.de> wrote:
>>>>>>>> would it be possible to change the execd to put any job that does not
>>>>>>>> exit with 0 into an error state? regardless of it being a kill -9?
>>>>>>>
>>>>>>> You can rerun the job automatically if you exit the epilog with 99.
>>>>>>
>>>>>> yes but with 137 or 139 i can't. and as the task hasn't successfully
>>>>>> finished i don't want it to start it's dependencies. i'd rather it
>>>>>> just go to a error state.
>>>>>
>>>>> You observe, that a job being rescheduled by exit 99 will trigger its 
>>>>> successors by -hold_jid to start?
>>>>>
>>>> no when i'm able to raise a 99 exit status it will not trigger it's
>>>> dependencies. however a task killed because of 137 or 139 do.
>>>> and I'd rather them error out with 100 them to be removed from the
>>>> queue all together.
>>>>
>>>> i know that the grid uses 137 when you request a qdel. and this makes
>>>> it kinda hard to stop a task if anything else would be put in a 100
>>>> error state.
>>>
>>> No, the chain of commands is the other way round. The `qdel` will send 
>>> sigkill to the job and remove it from the list of jobs in the system 
>>> (whatever you do or set in the epilog doesn't matter, as the job is to be 
>>> removed by the `qdel`).
>>>
>>> You can for example:
>>>
>>> - Submit all jobs with a user hold of the successor(s), this user hold you 
>>> can be removed in the epilog of the predecessor if it ran successful. The 
>>> name/jobid of the successor to be released could be put in a job context 
>>> which you have to read in the epilog and act accordingly.
>>
>> I could create this with my database layer however our system relies
>> very heavily on batching. so task1 -> task2 with the same batch range
>> but with different batch sizes. for example 1-100:25 for task1 and
>
> ^^^ job1
>
>> 1-100:1 for task2. how would I be able to find out what the other
>
> ^^^ job2
>
>> range is and how would i be able to un-hold that specific range?
>
> Why do you want to release only certain array tasks? Usually a plain 
> `qhold`/`qrls` like `qalter` will affect the complete array job, i.e. all 
> tasks. If for example task 26 of the first job fails, you only want to block 
> task 26 of job 2 and let all other run?
>


yes exactly. but knowing which things to unblock would be tricking.
unless there is information in the epilog on which task should be
unblocked in job2.

> Nevertheless, the above  commands allow a task range to be given or a single 
> task index:
>
> $ qrls 1234 -t 1-10
> $ qrls 1234.42
>
> will release only tasks 1-10 and the others are still on hold.
>
>
>>>
>>> - Create a special queue for some kind of `enabler' jobs which run forever 
>>> (loop e.g. once a minute until they quit), the original job will 
>>> create/touch a special file for which the `enabler' is waiting. If the 
>>> existence of the relevant file is detected, the `enabler' can release a 
>>> hold of a certain job or even just submit the successor job.
>>>
>>> - Creating a workflow can be done with: http://wildfire.bii.a-star.edu.sg/ 
>>> tool GEL http://wildfire.bii.a-star.edu.sg/docs/gel_ref.pdf where you can 
>>> check for files. But the jobs will be submitted during the workflow and not 
>>> all in advance. Maybe it is useful anyway.
>>>
>>> -- Reuti
>>
>> it would still be nice to know if it where possible to know implement
>> the "dormant" task approach. the company I work for would be willing
>> to pay for such development. depending on the feasibility.
>
> I'm still not sure what you mean by "dormant" state, as error state is not 
> sufficient. Similar you can use `qhold 1234.42` and `qmod -rj 1234.42` to put 
> the task 42 back into waiting state.
>
> In which state should a "dormant" task be?

if a task errors thats one thing. our wrappers catch that. see if you
hit the retry limit and exit with 100.
but there a many cases it errors with 137 or 139 and it gets removed
from the queue. or a task doesn't error but the host application spits
out corrupt data.

instead of removing a task i'd want to be able to run it again. just
have it be put in a none active state or "dormant" so that I could run
it again without having to submit a new set of task. we very rarely
run a single task.
they always have dependencies and always have batching. so being able
to run a subset of tasks again without having to do a re-submission
would make a huge different.


>
> -- Reuti

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] task exit status problems

Reply via email to