Hi Aeham, The problem is that you are not using MCF 1.7.1, which has a fix for the stuffer thread query that makes it run quickly on PostgreSQL 9.1 and above.
Thanks, Karl On Thu, Oct 30, 2014 at 7:30 PM, Aeham Abushwashi < [email protected]> wrote: > Hi, > > I have a multi-node zookeeper-based test environment with ~29M documents > (~36M jobqueue records and ~29M ingeststatus records) and jobs in varying > states of being (running vs complete vs stopped vs paused). Some jobs are > continuous and a few are not. > > During a quiet time -when no documents are actively ingested- I see very > high CPU utilisation on the Postgresql database server. This is primarily a > result of two queries executed by the Stuffer Thread; namely the main query > in JobManager#fetchAndProcessDocuments and the doc priority query in > JobManager#getNextDocuments > > ========================== > > Query 1: > SELECT t0.id > ,t0.docpriority,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset > FROM jobqueue t0 WHERE (t0.status=? OR t0.status=?) AND t0.checkaction=? > AND t0.checktime<=? AND EXISTS(SELECT 'x' FROM jobs t1 WHERE (t1.status=? > OR t1.status=?) AND t1.id=t0.jobid AND t1.priority=?) AND NOT > EXISTS(SELECT > 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND t2.status IN > (?,?,?,?,?,?) AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM > prereqevents t3,events t4 WHERE t0.id=t3.owner AND t3.eventname=t4.name) > ORDER BY t0.docpriority ASC LIMIT 280 > > Query 1 example params: > [P, G, R, 1414683318802, A, a, 5, A, F, a, f, D, d] > > ++++++++ > > Query 2: > SELECT docpriority,jobid,dochash,docid FROM jobqueue t0 WHERE (status=? > OR status=?) AND checkaction=? AND checktime<=? AND EXISTS(SELECT 'x' FROM > jobs t1 WHERE (t1.status=? OR t1.status=?) AND t1.id=t0.jobid) ORDER BY > docpriority ASC LIMIT 1 > > Query 2 example params: > [P, G, R, 1414685283327, A, a] > > ========================== > > There is one continuous job in a running state and which has ~50K documents > in a pending-purgatory ('G') state but not due for processing until another > couple of days. There are a few other jobs which are stopped and have a few > million documents in a pending-purgatory state. I don't expect Manifold to > find any documents eligible for processing in this case and it doesn't, but > it takes time and (more seriously) a big chunk of Postgresql CPU to find > the answer, 1 full CPU core for ~15 secs per MCF instance. This wouldn't be > a big deal if the queries were executed infrequently. Unfortunately, they > are run in a loop with a 2 second pause between iterations. > I'd argue that it would be relatively easy to run into this issue with a > bunch of large continuous jobs scheduled to run at varying times of the > day. > > With my limited knowledge of the inner workings of the Stuffer Thread, it's > not obvious to me why query 2 needs to run IF query 1 returns no hits at > all. If it's not needed, then short-circuitting it in this case would help. > > Increasing the sleep time between iterations of the Stuffer Thread, in > response to there being little or no work for a number of consecutive > iterations, could be another easy win. > > More profoundly would be to optimise the joins between the jobs table and > the jobqueue table. One way to do that would be to move some of the logic > to the Java code to first identify active jobs and then query the jobqueue > table just for those. I understand this would require merging and sorting > priorities in the Java code which is less than ideal. I continue to analyse > the execution plans of the two queries and experiment with slight variants > to test performance. I'll report back any important findings. > > Any thoughts or advice are welcome. Also if have experience running large > continuous crawls I'd be interested to hear from you particularly on your > experience with performance and especially the performance of SQL queries. > > Best regards, > Aeham >
