Re: (Continuous) crawl performance

Karl Wright Thu, 30 Oct 2014 18:54:02 -0700

Hi Aeham,

I opened CONNECTORS-1090 to track any changes that get made for this
problem.


I've attached a patch which should change query generation for the stuffer
query to remove the remaining IN clause.  Generally I don't expect that to
change the plan, but given that we don't understand the planner's decision
process fully, it may have an effect.

Karl


On Thu, Oct 30, 2014 at 9:04 PM, Karl Wright <[email protected]> wrote:

> Hi Aeham,
>
> The stuffer query is designed to make use of the following index, and
> indeed to read rows in sorting order FROM that index:
>
>       IndexDescription docpriorityIndex = new IndexDescription(false,new
> String[]{docPriorityField,statusField,checkActionField,checkTimeField});
>
> All that the joins do is effectively filter the results based on job
> status and based on whether a given document is actually present in and
> being worked on by another job.  The second case is rare, so any plan that
> emphasizes the jobqueue interaction is plain wrong.  I could see the first
> join being meaningful in some cases, but only if you had lots of inactive
> jobs which each had lots of documents. Please let me know if that is the
> situation here.
>
> It is a mystery why PostgreSQL does not always use this index.  In my
> experience, the main cause is due to PostgreSQL's cost optimizer making
> huge distinctions between when there is zero of something vs one of
> something.  You can tell readily enough if this is what is happening by
> getting a plan for the stuffer query, and if it is NOT reading from the
> index I mentioned, doing ANALYZE jobqueue, and then getting the plan
> again.  If the second plan is fine, then this previously noted PostgreSQL
> problem is to blame.
>
> I'm afraid that your notion of breaking up the query and doing it in Java
> is extremely problematic, because then we would be forced to read huge
> resultsets into memory.  If you demonstrate that an ANALYZE needs to be
> done more frequently, on the other hand, there is a configuration parameter
> for that.  Please let me know.
>
> Karl
>
> On Thu, Oct 30, 2014 at 8:28 PM, Aeham Abushwashi <
> [email protected]> wrote:
>
>> Hi Karl,
>>
>> I do have that change. Paul Boichat, who was involved in the conversation
>> leading up to CONNECTORS-1027, applied the fix and tested it in our
>> environment.
>>
>> I see your point that query #1 still has the IN (...) clause indicating
>> that it hasn't benefited from the CONNECTORS-1027 fix. However, in my
>> execution plan that part of the query is never executed anyway - see below
>> for details. In addition, query #2 is clean yet it hits the database hard.
>>
>> ======================
>>
>> Execution Plan for Query #1:
>>
>>  EXPLAIN ANALYZE SELECT
>> t0.id
>> ,t0.docpriority,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
>> FROM jobqueue t0  WHERE  (t0.status='P' OR t0.status='G') AND
>> t0.checkaction='R' AND t0.checktime<= 1414683318802 AND EXISTS(SELECT 'x'
>> FROM jobs t1 WHERE  (t1.status='A' OR t1.status='a') AND t1.id=t0.jobid
>> AND
>> t1.priority=5) AND NOT EXISTS(SELECT 'x' FROM jobqueue t2 WHERE
>> t2.dochash=t0.dochash AND t2.status IN ('A','F','a','f','D','d') AND
>> t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM prereqevents t3,events
>> t4 WHERE t0.id=t3.owner AND t3.eventname=t4.name) ORDER BY t0.docpriority
>> ASC LIMIT 280;
>>
>> QUERY PLAN
>>
>> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>  Limit  (cost=0.00..85910.03 rows=280 width=178) (actual
>> time=15152.258..15152.258 rows=0 loops=1)
>>    ->  Nested Loop Anti Join  (cost=0.00..3187261.91 rows=10388 width=178)
>> (actual time=15152.255..15152.255 rows=0 loops=1)
>>          Join Filter: ((t2.jobid <> t0.jobid) AND ((t2.dochash)::text =
>> (t0.dochash)::text))
>>          ->  Nested Loop Anti Join  (cost=0.00..3186114.69 rows=10388
>> width=178) (actual time=15152.253..15152.253 rows=0 loops=1)
>>                ->  Nested Loop Semi Join  (cost=0.00..3177053.22
>> rows=10388
>> width=178) (actual time=15152.252..15152.252 rows=0 loops=1)
>>                      ->  Index Scan using i1392985450172 on jobqueue t0
>> (cost=0.00..3141756.78 rows=124662 width=178) (actual
>> time=686.159..4046.926 rows=3346768 loops=1)
>>                            Index Cond: ((checkaction = 'R'::bpchar) AND
>> (checktime <= 1414683318802::bigint))
>>                            Filter: ((status = 'P'::bpchar) OR (status =
>> 'G'::bpchar))
>>                      ->  Index Scan using jobs_pkey on jobs t1
>> (cost=0.00..0.28 rows=1 width=8) (actual time=0.003..0.003 rows=0
>> loops=3346768)
>>                            Index Cond: (id = t0.jobid)
>>                            Filter: ((priority = 5) AND ((status =
>> 'A'::bpchar) OR (status = 'a'::bpchar)))
>>                            Rows Removed by Filter: 1
>>                ->  Nested Loop  (cost=0.00..0.86 rows=1 width=8) (never
>> executed)
>>                      ->  Index Scan using i1392985450178 on prereqevents
>> t3  (cost=0.00..0.27 rows=1 width=524) (never executed)
>>                            Index Cond: (t0.id = owner)
>>                      ->  Index Only Scan using events_pkey on events t4
>> (cost=0.00..0.58 rows=1 width=516) (never executed)
>>                            Index Cond: (name = (t3.eventname)::text)
>>                            Heap Fetches: 0
>>          ->  Materialize  (cost=0.00..56.48 rows=6 width=49) (never
>> executed)
>>                ->  Index Scan using i1392985450173 on jobqueue t2
>> (cost=0.00..56.45 rows=6 width=49) (never executed)
>>                      Index Cond: (status = ANY
>> ('{A,F,a,f,D,d}'::bpchar[]))
>>  Total runtime: 15152.487 ms
>>
>>
>> Execution Plan for Query #2:
>>
>> EXPLAIN ANALYZE SELECT docpriority,jobid,dochash,docid FROM jobqueue t0
>> WHERE  (status='P' OR status='G') AND checkaction='R' AND checktime<=
>> 1414685283327 AND EXISTS(SELECT 'x' FROM jobs t1 WHERE  (t1.status='A' OR
>> t1.status='a') AND t1.id=t0.jobid)  ORDER BY docpriority ASC LIMIT 1;
>>
>> QUERY PLAN
>>
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>>  Limit  (cost=0.00..304.78 rows=1 width=144) (actual
>> time=20620.458..20620.458 rows=0 loops=1)
>>    ->  Nested Loop Semi Join  (cost=0.00..3029548.13 rows=9940 width=144)
>> (actual time=20620.456..20620.456 rows=0 loops=1)
>>          ->  Index Scan using i1392985450172 on jobqueue t0
>> (cost=0.00..2996093.69 rows=119282 width=144) (actual
>> time=44.118..5665.201
>> rows=3346768 loops=1)
>>                Index Cond: ((checkaction = 'R'::bpchar) AND (checktime <=
>> 1414685283327::bigint))
>>                Filter: ((status = 'P'::bpchar) OR (status = 'G'::bpchar))
>>          ->  Index Scan using jobs_pkey on jobs t1  (cost=0.00..0.28
>> rows=1
>> width=8) (actual time=0.004..0.004 rows=0 loops=3346768)
>>                Index Cond: (id = t0.jobid)
>>                Filter: ((status = 'A'::bpchar) OR (status = 'a'::bpchar))
>>                Rows Removed by Filter: 1
>>  Total runtime: 20620.642 ms
>>
>> ======================
>>
>>
>> Best regards,
>> Aeham
>>
>> On 30 October 2014 23:45, Karl Wright <[email protected]> wrote:
>>
>> > The ticket, in case you are interested, is CONNECTORS-1027.
>> >
>> > Karl
>> >
>> >
>> > On Thu, Oct 30, 2014 at 7:42 PM, Karl Wright <[email protected]>
>> wrote:
>> >
>> > > Hi Aeham,
>> > >
>> > > The problem is that you are not using MCF 1.7.1, which has a fix for
>> the
>> > > stuffer thread query that makes it run quickly on PostgreSQL 9.1 and
>> > above.
>> > >
>> > > Thanks,
>> > > Karl
>> > >
>> > >
>> > > On Thu, Oct 30, 2014 at 7:30 PM, Aeham Abushwashi <
>> > > [email protected]> wrote:
>> > >
>> > >> Hi,
>> > >>
>> > >> I have a multi-node zookeeper-based test environment with ~29M
>> documents
>> > >> (~36M jobqueue records and ~29M ingeststatus records) and jobs in
>> > varying
>> > >> states of being (running vs complete vs stopped vs paused). Some jobs
>> > are
>> > >> continuous and a few are not.
>> > >>
>> > >> During a quiet time -when no documents are actively ingested- I see
>> very
>> > >> high CPU utilisation on the Postgresql database server. This is
>> > primarily
>> > >> a
>> > >> result of two queries executed by the Stuffer Thread; namely the main
>> > >> query
>> > >> in JobManager#fetchAndProcessDocuments and the doc priority query in
>> > >> JobManager#getNextDocuments
>> > >>
>> > >> ==========================
>> > >>
>> > >> Query 1:
>> > >> SELECT t0.id
>> > >>
>> >
>> ,t0.docpriority,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
>> > >> FROM jobqueue t0  WHERE  (t0.status=? OR t0.status=?) AND
>> > t0.checkaction=?
>> > >> AND t0.checktime<=? AND EXISTS(SELECT 'x' FROM jobs t1 WHERE
>> > (t1.status=?
>> > >> OR t1.status=?) AND t1.id=t0.jobid AND t1.priority=?) AND NOT
>> > >> EXISTS(SELECT
>> > >> 'x' FROM jobqueue t2 WHERE t2.dochash=t0.dochash AND t2.status IN
>> > >> (?,?,?,?,?,?) AND t2.jobid!=t0.jobid) AND NOT EXISTS(SELECT 'x' FROM
>> > >> prereqevents t3,events t4 WHERE t0.id=t3.owner AND t3.eventname=
>> t4.name
>> > )
>> > >> ORDER BY t0.docpriority ASC LIMIT 280
>> > >>
>> > >> Query 1 example params:
>> > >> [P, G, R, 1414683318802, A, a, 5, A, F, a, f, D, d]
>> > >>
>> > >> ++++++++
>> > >>
>> > >> Query 2:
>> > >> SELECT docpriority,jobid,dochash,docid FROM jobqueue t0  WHERE
>> > (status=?
>> > >> OR status=?) AND checkaction=? AND checktime<=? AND EXISTS(SELECT 'x'
>> > FROM
>> > >> jobs t1 WHERE  (t1.status=? OR t1.status=?) AND t1.id=t0.jobid)
>> ORDER
>> > BY
>> > >> docpriority ASC LIMIT 1
>> > >>
>> > >> Query 2 example params:
>> > >> [P, G, R, 1414685283327, A, a]
>> > >>
>> > >> ==========================
>> > >>
>> > >> There is one continuous job in a running state and which has ~50K
>> > >> documents
>> > >> in a pending-purgatory ('G') state but not due for processing until
>> > >> another
>> > >> couple of days. There are a few other jobs which are stopped and
>> have a
>> > >> few
>> > >> million documents in a pending-purgatory state. I don't expect
>> Manifold
>> > to
>> > >> find any documents eligible for processing in this case and it
>> doesn't,
>> > >> but
>> > >> it takes time and (more seriously) a big chunk of Postgresql CPU to
>> find
>> > >> the answer, 1 full CPU core for ~15 secs per MCF instance. This
>> wouldn't
>> > >> be
>> > >> a big deal if the queries were executed infrequently. Unfortunately,
>> > they
>> > >> are run in a loop with a 2 second pause between iterations.
>> > >> I'd argue that it would be relatively easy to run into this issue
>> with a
>> > >> bunch of large continuous jobs scheduled to run at varying times of
>> the
>> > >> day.
>> > >>
>> > >> With my limited knowledge of the inner workings of the Stuffer
>> Thread,
>> > >> it's
>> > >> not obvious to me why query 2 needs to run IF query 1 returns no
>> hits at
>> > >> all. If it's not needed, then short-circuitting it in this case would
>> > >> help.
>> > >>
>> > >> Increasing the sleep time between iterations of the Stuffer Thread,
>> in
>> > >> response to there being little or no work for a number of consecutive
>> > >> iterations, could be another easy win.
>> > >>
>> > >> More profoundly would be to optimise the joins between the jobs table
>> > and
>> > >> the jobqueue table. One way to do that would be to move some of the
>> > logic
>> > >> to the Java code to first identify active jobs and then query the
>> > jobqueue
>> > >> table just for those. I understand this would require merging and
>> > sorting
>> > >> priorities in the Java code which is less than ideal. I continue to
>> > >> analyse
>> > >> the execution plans of the two queries and experiment with slight
>> > variants
>> > >> to test performance. I'll report back any important findings.
>> > >>
>> > >> Any thoughts or advice are welcome. Also if have experience running
>> > large
>> > >> continuous crawls I'd be interested to hear from you particularly on
>> > your
>> > >> experience with performance and especially the performance of SQL
>> > queries.
>> > >>
>> > >> Best regards,
>> > >> Aeham
>> > >>
>> > >
>> > >
>> >
>>
>
>

Re: (Continuous) crawl performance

Reply via email to