On 8 September 2012 00:09, Reuti <re...@staff.uni-marburg.de> wrote: > > Am 07.09.2012 um 23:12 schrieb Lars van der bijl: > >> On 7 September 2012 19:41, Reuti <re...@staff.uni-marburg.de> wrote: >>> Am 07.09.2012 um 18:39 schrieb Lars van der bijl: >>> >>>> On 7 September 2012 17:48, Reuti <re...@staff.uni-marburg.de> wrote: >>>>> Am 07.09.2012 um 17:45 schrieb Lars van der bijl: >>>>> >>>>>> On 7 September 2012 17:23, Reuti <re...@staff.uni-marburg.de> wrote: >>>>>>>> would it be possible to change the execd to put any job that does not >>>>>>>> exit with 0 into an error state? regardless of it being a kill -9? >>>>>>> >>>>>>> You can rerun the job automatically if you exit the epilog with 99. >>>>>> >>>>>> yes but with 137 or 139 i can't. and as the task hasn't successfully >>>>>> finished i don't want it to start it's dependencies. i'd rather it >>>>>> just go to a error state. >>>>> >>>>> You observe, that a job being rescheduled by exit 99 will trigger its >>>>> successors by -hold_jid to start? >>>>> >>>> no when i'm able to raise a 99 exit status it will not trigger it's >>>> dependencies. however a task killed because of 137 or 139 do. >>>> and I'd rather them error out with 100 them to be removed from the >>>> queue all together. >>>> >>>> i know that the grid uses 137 when you request a qdel. and this makes >>>> it kinda hard to stop a task if anything else would be put in a 100 >>>> error state. >>> >>> No, the chain of commands is the other way round. The `qdel` will send >>> sigkill to the job and remove it from the list of jobs in the system >>> (whatever you do or set in the epilog doesn't matter, as the job is to be >>> removed by the `qdel`). >>> >>> You can for example: >>> >>> - Submit all jobs with a user hold of the successor(s), this user hold you >>> can be removed in the epilog of the predecessor if it ran successful. The >>> name/jobid of the successor to be released could be put in a job context >>> which you have to read in the epilog and act accordingly. >> >> I could create this with my database layer however our system relies >> very heavily on batching. so task1 -> task2 with the same batch range >> but with different batch sizes. for example 1-100:25 for task1 and > > ^^^ job1 > >> 1-100:1 for task2. how would I be able to find out what the other > > ^^^ job2 > >> range is and how would i be able to un-hold that specific range? > > Why do you want to release only certain array tasks? Usually a plain > `qhold`/`qrls` like `qalter` will affect the complete array job, i.e. all > tasks. If for example task 26 of the first job fails, you only want to block > task 26 of job 2 and let all other run? >
yes exactly. but knowing which things to unblock would be tricking. unless there is information in the epilog on which task should be unblocked in job2. > Nevertheless, the above commands allow a task range to be given or a single > task index: > > $ qrls 1234 -t 1-10 > $ qrls 1234.42 > > will release only tasks 1-10 and the others are still on hold. > > >>> >>> - Create a special queue for some kind of `enabler' jobs which run forever >>> (loop e.g. once a minute until they quit), the original job will >>> create/touch a special file for which the `enabler' is waiting. If the >>> existence of the relevant file is detected, the `enabler' can release a >>> hold of a certain job or even just submit the successor job. >>> >>> - Creating a workflow can be done with: http://wildfire.bii.a-star.edu.sg/ >>> tool GEL http://wildfire.bii.a-star.edu.sg/docs/gel_ref.pdf where you can >>> check for files. But the jobs will be submitted during the workflow and not >>> all in advance. Maybe it is useful anyway. >>> >>> -- Reuti >> >> it would still be nice to know if it where possible to know implement >> the "dormant" task approach. the company I work for would be willing >> to pay for such development. depending on the feasibility. > > I'm still not sure what you mean by "dormant" state, as error state is not > sufficient. Similar you can use `qhold 1234.42` and `qmod -rj 1234.42` to put > the task 42 back into waiting state. > > In which state should a "dormant" task be? if a task errors thats one thing. our wrappers catch that. see if you hit the retry limit and exit with 100. but there a many cases it errors with 137 or 139 and it gets removed from the queue. or a task doesn't error but the host application spits out corrupt data. instead of removing a task i'd want to be able to run it again. just have it be put in a none active state or "dormant" so that I could run it again without having to submit a new set of task. we very rarely run a single task. they always have dependencies and always have batching. so being able to run a subset of tasks again without having to do a re-submission would make a huge different. > > -- Reuti _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users