Re: JobManager/JobPoller issues

Scott Gray Mon, 04 Feb 2019 15:24:28 -0800

Hi Taher,

I say that it isn't a drop-in replacement solely because it is unbounded
whereas the current implementation appears to depend on the queue being
bounded by the number set in the serviceengine.xml thread-pool.jobs
attribute.

The main concern I have about an unbounded queue is the potential for
instability when you have tens or hundreds of thousands of jobs pending.
I'm not sure about the current implementation but I know the previous
implementation had issues if the poll held the lock for too long while
queuing up large numbers of jobs.

Although with all of that said, after a quick second look it appears that
the current implementation doesn't try poll for more jobs than the
configured limit (minus already queued jobs) so we might be fine with an
unbounded queue implementation.  We'd just need to alter the call to
JobManager.poll(int limit) to not pass in
executor.getQueue().remainingCapacity() and instead pass in something like
(threadPool.getJobs() - executor.getQueue().size())

I'll keep pondering other options but a PriorityBlockingQueue might be a
good first step, initially to push PurgeJobs to the back of the queue and
perhaps later ServiceJobs/PersistedServiceJobs can be given a priority via
the LocalDispatcher API.

In regards to OFBIZ-10592, I'd be very surprised if the JobManager itself
was the cause of out of memory errors on a 20GB heap.  It sounds to me like
autoDeleteAutoSaveShoppingList was written expecting a low number of
records to process and it starting hitting transaction timeouts when the
record count got too large, they probably ignored/weren't monitoring those
failures and the load on that specific service continued to grow with each
TTO rollback until now they're finally hitting OOM errors every time it
tries to run.

Regards
Scott

On Mon, 4 Feb 2019 at 09:07, Taher Alkhateeb <[email protected]>
wrote:

> Hi Scott,
>
> It seems we have some issues currently with our job scheduler [1]
> which seems to be some sort of memory leak. We are also experiencing
> some performance issues and other anomalies. It seems like a good time
> to perhaps revisit the whole thing.
>
> Are you suggesting to replace LinkedBlockingQueue with
> PriorityBlockingQueue? If so I think it might actually be a better
> option. I think being unbounded _might_ actually resolve some of the
> pain points we're facing. I didn't get why it's not a drop-in
> replacement though. It matches the signature of the call in the
> executor service unless i'm missing something somewhere?
>
> [1] https://issues.apache.org/jira/browse/OFBIZ-10592
>
> On Wed, Jan 30, 2019 at 10:59 PM Scott Gray
> <[email protected]> wrote:
> >
> > Hi folks,
> >
> > Just jotting down some issues with the JobManager over noticed over the
> > last few days:
> > 1. min-threads in serviceengine.xml is never exceeded unless the job
> count
> > in the queue exceeds 5000 (or whatever is configured).  Is this not
> obvious
> > to anyone else?  I don't think this was the behavior prior to a
> refactoring
> > a few years ago.
> > 2. The advice on the number of threads to use doesn't seem good to me, it
> > assumes your jobs are CPU bound when in my experience they are more
> likely
> > to be I/O bound while making db or external API calls, sending emails
> etc.
> > With the default setup, it only takes two long running jobs to
> effectively
> > block the processing of any others until the queue hits 5000 and the
> other
> > threads are finally opened up.  If you're not quickly maxing out the
> queue
> > then any other jobs are stuck until the slow jobs finally complete.
> > 3. Purging old jobs doesn't seem to be well implemented to me, from what
> > I've seen the system is only capable of clearing a few hundred per minute
> > and if you've filled the queue with them then regular jobs have to queue
> > behind them and can take many minutes to finally be executed.
> >
> > I'm wondering if anyone has experimented with reducing the queue the
> size?
> > I'm considering reducing it to say 100 jobs per thread (along with
> > increasing the thread count).  In theory it would reduce the time real
> jobs
> > have to sit behind PurgeJobs and would also open up additional threads
> for
> > use earlier.
> >
> > Alternatively I've pondered trying a PriorityBlockingQueue for the job
> > queue (unfortunately the implementation is unbounded though so it isn't a
> > drop-in replacement) so that PurgeJobs always sit at the back of the
> > queue.  It might also allow prioritizing certain "user facing" jobs (such
> > as asynchronous data imports) over lower priority less time critical
> jobs.
> > Maybe another option (or in conjunction) is some sort of "swim-lane"
> > queue/executor that allocates jobs to threads based on prior execution
> > speed so that slow running jobs can never use up all threads and block
> > faster jobs.
> >
> > Any thoughts/experiences you have to share would be appreciated.
> >
> > Thanks
> > Scott
>

Re: JobManager/JobPoller issues

Reply via email to