Hi,

I put this comment there with OFBIZ-10002 trying to document why we have 5 as hardcoded value of /max-threads/ attribute in /thread-pool/ element (serviceengine.xml). At this moment Scott already mentioned[1]:

   /Honestly I think the topic is generic enough that OFBiz doesn't need to 
provide any information at all. Thread pool sizing is not exclusive to
   OFBiz and it would be strange for anyone to modify the numbers without first 
researching sources that provide far more detail than a few sentences
   in our config files will ever cover./

I agree with Scott and Jacopo that jobs are more likely IO rather than CPU bounded. So I agree that we should take that into account, change the current algorithm and remove this somehow misleading comment. Scott's suggestion in his 2nd email sounds good to me. So If I understood well we could use an unbounded but finally limited queue, like it was before.

   Although with all of that said, after a quick second look it appears that
   the current implementation doesn't try poll for more jobs than the
   configured limit (minus already queued jobs) so we might be fine with an
   unbounded queue implementation.  We'd just need to alter the call to
   JobManager.poll(int limit) to not pass in
   executor.getQueue().remainingCapacity() and instead pass in something like
   (threadPool.getJobs() - executor.getQueue().size())

I'm fine with that as it would continue to prevent hitting physical limitations and can be tweaked by users as it's now. Note that it seems though uneasy to tweak as we received already several "complaints" about it.

Now one of the advantage of a PriorityBlockingQueue is priority. And to take advantage of that we can't rely on "/natural ordering"/ and need to implement Comparable (which does no seem easy). Nicolas provided some leads below and this should be discussed. The must would be to have that parametrised, of course.

My 2 cts
//

[1] https://markmail.org/message/ixzluzd44rgloa2j

Jacques

Le 06/02/2019 à 14:24, Nicolas Malin a écrit :
Hello Scott,

On a customer project we use massively the job manager with an average of one 
hundred thousand job per days.

We have different cases like, huge long jobs, async persistent job, fast regular job. The mainly problem that we detect has been (as you notified) the long jobs that stuck poller's thread and when we restart OFBiz (we are on continuous delivery) we hadn't windows this without crash some jobs.

To solve try with Gil to analyze if we can load some weighting on job definition to help the job manager on what jobs on the pending queue it can push on queued queue. We changed own vision to create two pools, one for system maintenance and huge long jobs managed by two ofbiz instances and an other to manage user activity jobs also managed by two instances. We also added on service definition an information to indicate the predilection pool

This isn't a big deal and not resolve the stuck pool but all blocked jobs 
aren't vital for daily activity.

For crashed job, we introduced in trunk service lock that we set before an 
update and wait a windows for the restart.

At this time for all OOM detected we reanalyse the origin job and tried to 
decompose it by persistent async service to help loading repartition.

If I had more time, I would be oriented job improvement to :

 * Define an execution plan rule to link services and poller without touch any 
service definition

 * Define configuration by instance for the job vacuum to refine by service 
volumetric

This feedback is a little confused Scott, maybe you found interesting things

Nicolas

On 30/01/2019 20:47, Scott Gray wrote:
Hi folks,

Just jotting down some issues with the JobManager over noticed over the
last few days:
1. min-threads in serviceengine.xml is never exceeded unless the job count
in the queue exceeds 5000 (or whatever is configured).  Is this not obvious
to anyone else?  I don't think this was the behavior prior to a refactoring
a few years ago.
2. The advice on the number of threads to use doesn't seem good to me, it
assumes your jobs are CPU bound when in my experience they are more likely
to be I/O bound while making db or external API calls, sending emails etc.
With the default setup, it only takes two long running jobs to effectively
block the processing of any others until the queue hits 5000 and the other
threads are finally opened up.  If you're not quickly maxing out the queue
then any other jobs are stuck until the slow jobs finally complete.
3. Purging old jobs doesn't seem to be well implemented to me, from what
I've seen the system is only capable of clearing a few hundred per minute
and if you've filled the queue with them then regular jobs have to queue
behind them and can take many minutes to finally be executed.

I'm wondering if anyone has experimented with reducing the queue the size?
I'm considering reducing it to say 100 jobs per thread (along with
increasing the thread count).  In theory it would reduce the time real jobs
have to sit behind PurgeJobs and would also open up additional threads for
use earlier.

Alternatively I've pondered trying a PriorityBlockingQueue for the job
queue (unfortunately the implementation is unbounded though so it isn't a
drop-in replacement) so that PurgeJobs always sit at the back of the
queue.  It might also allow prioritizing certain "user facing" jobs (such
as asynchronous data imports) over lower priority less time critical jobs.
Maybe another option (or in conjunction) is some sort of "swim-lane"
queue/executor that allocates jobs to threads based on prior execution
speed so that slow running jobs can never use up all threads and block
faster jobs.

Any thoughts/experiences you have to share would be appreciated.

Thanks
Scott


Reply via email to