In the message dated: Thu, 25 Feb 2016 13:38:28 -0800, The pithy ruminations from Skylar Thompson on <Re: [gridengine users] Fwd: dispatching sge task from an sge task - is that a reasonable practice?> were: => We have pipelines that are driven by a qsub at the end of a batch script.
We have several pipelines like that. It's not uncommon to have the initial job submit 5~20 other jobs. I don't think it's particularly clean or 'elegant', but it isn't a fundamentally unresonable practice. => Error tracking is an issue but sometimes it's easier to do that than to Yes. => engineer a raft of job dependencies. As you note, concurrency can be an => issue, but there are a number of ways to deal with that: => => * Lock file in a POSIX-compliant filesystem => * Semaphore in a network-accessible database => => You can prevent dead jobs from stalling other jobs by tying the => lock/semaphore back to a job and ensuring that it's still running. So, the lockfile contains the jobID, or something similar? Some jobs use a file as a flag (ie. checking the existence or content), but we've largely avoided POSIX file locking (the mixture of NFS, GPFS, & CIFS here should work....but it gets complicated quickly). Our better 'chained' jobs use the SGE "hold" feature, some are launched as array jobs, and some (the really ulgy ones) loop within a shell script, checking for files that indicate that a prerequisite job finished, checking for errors in the prereq, or loop over 'qstat' checking if a specific jobid has completed. => => On Thu, Feb 25, 2016 at 11:16:49PM +0200, Ben Daniel Pere wrote: => > Where I work, we have jobs that submit jobs that submit jobs.. this could => > potentially cause a deadlock but we're somehow (probably luck) manage to => > live with it.. I'm wondering if that's a reasonable practice and if not if => > you can suggest a better way to do what we do.. => > => > Example: => > => > we have these 3 tasks: => > => > - "analyze.day" job analyzed a day of data and returns some output => > - "analyze.month" job sends "analyze.day" jobs for a whole month and => > outputs summary => > - "analyze.year" job sends "analyze.month" jobs for a whole year and => > outputs summary => > => > usually people run analyze.day everyday on previous day but sometimes they => > test their new algorithm on a whole year so they dispatch analyze.year => > which dispatched analyze.month which dispatched analyze.day.. => > We created a "dispatching" queue which is the only queue we allow => > submitting jobs from but since both analyze.year and analyze.month need to => > run there (both dispatch tasks) we could end up with a dead lock => > (theoretically, lots of analyze.year running together taking all => > dispatching queue slots and not leaving room for analyze.month tasks which => > they will forever wait for), also besides dispatching they also do some Hmmm... that's a pretty specific case. Does this happen often enough to really require a technical solution (another queue, a JSV that checks if there are available slots before allowing an analyze.year job into the queue, etc), or would it be better to 'solve' this via user education & documentation? => > logic so it's a strange animal, this "dispatching" queue.. => > => > What's the "correct" practice here? => Mark _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
