Job prioritization is committed in r1857071, thanks everyone who provided their thoughts.
Regards Scott On Sat, 16 Mar 2019, 22:30 Scott Gray, <scott.g...@hotwaxsystems.com> wrote: > Patch available at https://issues.apache.org/jira/browse/OFBIZ-10865 > > Reviews welcome, I probably won't have time to commit it for a few weeks > so no rush. > > By the way I was amazed to notice that jobs are limited to 100 jobs per > poll with a 30 second poll time, seems extremely conservative. They would > have to be very slow jobs to not have the executor be idle most of the > time. If no one objects I'd like to increase this to 2000 jobs with a 10 > second poll time. > > Thanks > Scott > > On Tue, 26 Feb 2019 at 09:13, Scott Gray <scott.g...@hotwaxsystems.com> > wrote: > >> Hi Jacques, >> >> I'm working on implementing the priority queue approach at the moment for >> a client. All things going well it will be in production in a couple of >> weeks and I'll report back then with a patch. >> >> Regards >> Scott >> >> On Tue, 26 Feb 2019 at 03:11, Jacques Le Roux < >> jacques.le.r...@les7arts.com> wrote: >> >>> Hi, >>> >>> I put this comment there with OFBIZ-10002 trying to document why we have >>> 5 as hardcoded value of /max-threads/ attribute in /thread-pool/ element >>> (serviceengine.xml). At this moment Scott already mentioned[1]: >>> >>> /Honestly I think the topic is generic enough that OFBiz doesn't >>> need to provide any information at all. Thread pool sizing is not exclusive >>> to >>> OFBiz and it would be strange for anyone to modify the numbers >>> without first researching sources that provide far more detail than a few >>> sentences >>> in our config files will ever cover./ >>> >>> I agree with Scott and Jacopo that jobs are more likely IO rather than >>> CPU bounded. So I agree that we should take that into account, change the >>> current algorithm and remove this somehow misleading comment. Scott's >>> suggestion in his 2nd email sounds good to me. So If I understood well we >>> could >>> use an unbounded but finally limited queue, like it was before. >>> >>> Although with all of that said, after a quick second look it appears >>> that >>> the current implementation doesn't try poll for more jobs than the >>> configured limit (minus already queued jobs) so we might be fine >>> with an >>> unbounded queue implementation. We'd just need to alter the call to >>> JobManager.poll(int limit) to not pass in >>> executor.getQueue().remainingCapacity() and instead pass in >>> something like >>> (threadPool.getJobs() - executor.getQueue().size()) >>> >>> I'm fine with that as it would continue to prevent hitting physical >>> limitations and can be tweaked by users as it's now. Note that it seems >>> though >>> uneasy to tweak as we received already several "complaints" about it. >>> >>> Now one of the advantage of a PriorityBlockingQueue is priority. And to >>> take advantage of that we can't rely on "/natural ordering"/ and need to >>> implement Comparable (which does no seem easy). Nicolas provided some >>> leads below and this should be discussed. The must would be to have that >>> parametrised, of course. >>> >>> My 2 cts >>> // >>> >>> [1] https://markmail.org/message/ixzluzd44rgloa2j >>> >>> Jacques >>> >>> Le 06/02/2019 à 14:24, Nicolas Malin a écrit : >>> > Hello Scott, >>> > >>> > On a customer project we use massively the job manager with an average >>> of one hundred thousand job per days. >>> > >>> > We have different cases like, huge long jobs, async persistent job, >>> fast regular job. The mainly problem that we detect has been (as you >>> notified) >>> > the long jobs that stuck poller's thread and when we restart OFBiz (we >>> are on continuous delivery) we hadn't windows this without crash some jobs. >>> > >>> > To solve try with Gil to analyze if we can load some weighting on job >>> definition to help the job manager on what jobs on the pending queue it can >>> > push on queued queue. We changed own vision to create two pools, one >>> for system maintenance and huge long jobs managed by two ofbiz instances >>> and an >>> > other to manage user activity jobs also managed by two instances. We >>> also added on service definition an information to indicate the >>> predilection pool >>> > >>> > This isn't a big deal and not resolve the stuck pool but all blocked >>> jobs aren't vital for daily activity. >>> > >>> > For crashed job, we introduced in trunk service lock that we set >>> before an update and wait a windows for the restart. >>> > >>> > At this time for all OOM detected we reanalyse the origin job and >>> tried to decompose it by persistent async service to help loading >>> repartition. >>> > >>> > If I had more time, I would be oriented job improvement to : >>> > >>> > * Define an execution plan rule to link services and poller without >>> touch any service definition >>> > >>> > * Define configuration by instance for the job vacuum to refine by >>> service volumetric >>> > >>> > This feedback is a little confused Scott, maybe you found interesting >>> things >>> > >>> > Nicolas >>> > >>> > On 30/01/2019 20:47, Scott Gray wrote: >>> >> Hi folks, >>> >> >>> >> Just jotting down some issues with the JobManager over noticed over >>> the >>> >> last few days: >>> >> 1. min-threads in serviceengine.xml is never exceeded unless the job >>> count >>> >> in the queue exceeds 5000 (or whatever is configured). Is this not >>> obvious >>> >> to anyone else? I don't think this was the behavior prior to a >>> refactoring >>> >> a few years ago. >>> >> 2. The advice on the number of threads to use doesn't seem good to >>> me, it >>> >> assumes your jobs are CPU bound when in my experience they are more >>> likely >>> >> to be I/O bound while making db or external API calls, sending emails >>> etc. >>> >> With the default setup, it only takes two long running jobs to >>> effectively >>> >> block the processing of any others until the queue hits 5000 and the >>> other >>> >> threads are finally opened up. If you're not quickly maxing out the >>> queue >>> >> then any other jobs are stuck until the slow jobs finally complete. >>> >> 3. Purging old jobs doesn't seem to be well implemented to me, from >>> what >>> >> I've seen the system is only capable of clearing a few hundred per >>> minute >>> >> and if you've filled the queue with them then regular jobs have to >>> queue >>> >> behind them and can take many minutes to finally be executed. >>> >> >>> >> I'm wondering if anyone has experimented with reducing the queue the >>> size? >>> >> I'm considering reducing it to say 100 jobs per thread (along with >>> >> increasing the thread count). In theory it would reduce the time >>> real jobs >>> >> have to sit behind PurgeJobs and would also open up additional >>> threads for >>> >> use earlier. >>> >> >>> >> Alternatively I've pondered trying a PriorityBlockingQueue for the job >>> >> queue (unfortunately the implementation is unbounded though so it >>> isn't a >>> >> drop-in replacement) so that PurgeJobs always sit at the back of the >>> >> queue. It might also allow prioritizing certain "user facing" jobs >>> (such >>> >> as asynchronous data imports) over lower priority less time critical >>> jobs. >>> >> Maybe another option (or in conjunction) is some sort of "swim-lane" >>> >> queue/executor that allocates jobs to threads based on prior execution >>> >> speed so that slow running jobs can never use up all threads and block >>> >> faster jobs. >>> >> >>> >> Any thoughts/experiences you have to share would be appreciated. >>> >> >>> >> Thanks >>> >> Scott >>> >> >>> > >>> >>