The ticket, in case you are interested, is CONNECTORS-1027. Karl
On Thu, Oct 30, 2014 at 7:42 PM, Karl Wright <[email protected]> wrote: > Hi Aeham, > > The problem is that you are not using MCF 1.7.1, which has a fix for the > stuffer thread query that makes it run quickly on PostgreSQL 9.1 and above. > > Thanks, > Karl > > > On Thu, Oct 30, 2014 at 7:30 PM, Aeham Abushwashi < > [email protected]> wrote: > >> Hi, >> >> I have a multi-node zookeeper-based test environment with ~29M documents >> (~36M jobqueue records and ~29M ingeststatus records) and jobs in varying >> states of being (running vs complete vs stopped vs paused). Some jobs are >> continuous and a few are not. >> >> During a quiet time -when no documents are actively ingested- I see very >> high CPU utilisation on the Postgresql database server. This is primarily >> a >> result of two queries executed by the Stuffer Thread; namely the main >> query >> in JobManager#fetchAndProcessDocuments and the doc priority query in >> JobManager#getNextDocuments >> >> ========================== >> >> Query 1: >> SELECT t0.id >> ,t0.docpriority,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset >> FROM jobqueue t0 WHERE (t0.status=? OR t0.status=?) AND t0.checkaction=? >> AND t0.checktime<=? AND EXISTS(SELECT 'x' FROM jobs t1 WHERE (t1.status=? >> OR t1.status=?) AND t1.id=t0.jobid AND t1.priority=?) AND NOT >> EXISTS(SELECT >> 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND t2.status IN >> (?,?,?,?,?,?) AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM >> prereqevents t3,events t4 WHERE t0.id=t3.owner AND t3.eventname=t4.name) >> ORDER BY t0.docpriority ASC LIMIT 280 >> >> Query 1 example params: >> [P, G, R, 1414683318802, A, a, 5, A, F, a, f, D, d] >> >> ++++++++ >> >> Query 2: >> SELECT docpriority,jobid,dochash,docid FROM jobqueue t0 WHERE (status=? >> OR status=?) AND checkaction=? AND checktime<=? AND EXISTS(SELECT 'x' FROM >> jobs t1 WHERE (t1.status=? OR t1.status=?) AND t1.id=t0.jobid) ORDER BY >> docpriority ASC LIMIT 1 >> >> Query 2 example params: >> [P, G, R, 1414685283327, A, a] >> >> ========================== >> >> There is one continuous job in a running state and which has ~50K >> documents >> in a pending-purgatory ('G') state but not due for processing until >> another >> couple of days. There are a few other jobs which are stopped and have a >> few >> million documents in a pending-purgatory state. I don't expect Manifold to >> find any documents eligible for processing in this case and it doesn't, >> but >> it takes time and (more seriously) a big chunk of Postgresql CPU to find >> the answer, 1 full CPU core for ~15 secs per MCF instance. This wouldn't >> be >> a big deal if the queries were executed infrequently. Unfortunately, they >> are run in a loop with a 2 second pause between iterations. >> I'd argue that it would be relatively easy to run into this issue with a >> bunch of large continuous jobs scheduled to run at varying times of the >> day. >> >> With my limited knowledge of the inner workings of the Stuffer Thread, >> it's >> not obvious to me why query 2 needs to run IF query 1 returns no hits at >> all. If it's not needed, then short-circuitting it in this case would >> help. >> >> Increasing the sleep time between iterations of the Stuffer Thread, in >> response to there being little or no work for a number of consecutive >> iterations, could be another easy win. >> >> More profoundly would be to optimise the joins between the jobs table and >> the jobqueue table. One way to do that would be to move some of the logic >> to the Java code to first identify active jobs and then query the jobqueue >> table just for those. I understand this would require merging and sorting >> priorities in the Java code which is less than ideal. I continue to >> analyse >> the execution plans of the two queries and experiment with slight variants >> to test performance. I'll report back any important findings. >> >> Any thoughts or advice are welcome. Also if have experience running large >> continuous crawls I'd be interested to hear from you particularly on your >> experience with performance and especially the performance of SQL queries. >> >> Best regards, >> Aeham >> > >
