Re: (Continuous) crawl performance

Karl Wright Thu, 30 Oct 2014 16:43:42 -0700

Hi Aeham,

The problem is that you are not using MCF 1.7.1, which has a fix for the
stuffer thread query that makes it run quickly on PostgreSQL 9.1 and above.


Thanks,
Karl


On Thu, Oct 30, 2014 at 7:30 PM, Aeham Abushwashi <
[email protected]> wrote:

> Hi,
>
> I have a multi-node zookeeper-based test environment with ~29M documents
> (~36M jobqueue records and ~29M ingeststatus records) and jobs in varying
> states of being (running vs complete vs stopped vs paused). Some jobs are
> continuous and a few are not.
>
> During a quiet time -when no documents are actively ingested- I see very
> high CPU utilisation on the Postgresql database server. This is primarily a
> result of two queries executed by the Stuffer Thread; namely the main query
> in JobManager#fetchAndProcessDocuments and the doc priority query in
> JobManager#getNextDocuments
>
> ==========================
>
> Query 1:
> SELECT t0.id
> ,t0.docpriority,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
> FROM jobqueue t0  WHERE  (t0.status=? OR t0.status=?) AND t0.checkaction=?
> AND t0.checktime<=? AND EXISTS(SELECT 'x' FROM jobs t1 WHERE  (t1.status=?
> OR t1.status=?) AND t1.id=t0.jobid AND t1.priority=?) AND NOT
> EXISTS(SELECT
> 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND t2.status IN
> (?,?,?,?,?,?) AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM
> prereqevents t3,events t4 WHERE t0.id=t3.owner AND t3.eventname=t4.name)
> ORDER BY t0.docpriority ASC LIMIT 280
>
> Query 1 example params:
> [P, G, R, 1414683318802, A, a, 5, A, F, a, f, D, d]
>
> ++++++++
>
> Query 2:
> SELECT docpriority,jobid,dochash,docid FROM jobqueue t0  WHERE  (status=?
> OR status=?) AND checkaction=? AND checktime<=? AND EXISTS(SELECT 'x' FROM
> jobs t1 WHERE  (t1.status=? OR t1.status=?) AND t1.id=t0.jobid)  ORDER BY
> docpriority ASC LIMIT 1
>
> Query 2 example params:
> [P, G, R, 1414685283327, A, a]
>
> ==========================
>
> There is one continuous job in a running state and which has ~50K documents
> in a pending-purgatory ('G') state but not due for processing until another
> couple of days. There are a few other jobs which are stopped and have a few
> million documents in a pending-purgatory state. I don't expect Manifold to
> find any documents eligible for processing in this case and it doesn't, but
> it takes time and (more seriously) a big chunk of Postgresql CPU to find
> the answer, 1 full CPU core for ~15 secs per MCF instance. This wouldn't be
> a big deal if the queries were executed infrequently. Unfortunately, they
> are run in a loop with a 2 second pause between iterations.
> I'd argue that it would be relatively easy to run into this issue with a
> bunch of large continuous jobs scheduled to run at varying times of the
> day.
>
> With my limited knowledge of the inner workings of the Stuffer Thread, it's
> not obvious to me why query 2 needs to run IF query 1 returns no hits at
> all. If it's not needed, then short-circuitting it in this case would help.
>
> Increasing the sleep time between iterations of the Stuffer Thread, in
> response to there being little or no work for a number of consecutive
> iterations, could be another easy win.
>
> More profoundly would be to optimise the joins between the jobs table and
> the jobqueue table. One way to do that would be to move some of the logic
> to the Java code to first identify active jobs and then query the jobqueue
> table just for those. I understand this would require merging and sorting
> priorities in the Java code which is less than ideal. I continue to analyse
> the execution plans of the two queries and experiment with slight variants
> to test performance. I'll report back any important findings.
>
> Any thoughts or advice are welcome. Also if have experience running large
> continuous crawls I'd be interested to hear from you particularly on your
> experience with performance and especially the performance of SQL queries.
>
> Best regards,
> Aeham
>

Re: (Continuous) crawl performance

Reply via email to