Re: (Continuous) crawl performance

Karl Wright Sat, 01 Nov 2014 05:12:08 -0700

Hi Aeham,

To figure out what is going on, it would probably be sufficient to execute
the following queries and send me the results:


EXPLAIN SELECT 
t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
FROM jobqueue t0 WHERE t0.checkaction='R' AND t0.checktime<=2414771559577
ORDER BY t0.docpriority LIMIT 20;
SELECT 
t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
FROM jobqueue t0 WHERE t0.checkaction='R' AND t0.checktime<=2414771559577
ORDER BY t0.docpriority LIMIT 20;

These should both read directly from the index in question; we can see
immediately what kinds of documents it is having to skip to get to the
active ones.

Thanks!
Karl


On Fri, Oct 31, 2014 at 2:42 PM, Karl Wright <[email protected]> wrote:

> Hi Aeham,
>
> I've looked at this at some level of detail now, and have some conclusions.
>
> (1) The first query you provided is in fact using the proper index for
> ordering; it's just not obvious from the plan.  The way you can tell is if
> there is an explicit sort/limit.  The primary jobqueue index scan is
> happening within an anti-join and a semi-join; you can tell this by cutting
> out all the EXISTS clauses and getting a reduced plan that way, and
> comparing.  For instance, here's an explain of the simple query:
>
> >>>>>>
> dbname=# explain SELECT 
> t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
> FROM jobqueue t0  ORDER BY t0.docpriority LIMIT 240;
>                                            QUERY PLAN
>
>
> -------------------------------------------------------------------------------------------------
>  Limit  (cost=0.29..38.20 rows=240 width=161)
>    ->  Index Scan using i1413214448001 on jobqueue t0  (cost=0.29..2312.39
> rows=14634 width=161)
> (2 rows)
> <<<<<<
>
> Applying filtering criteria:
>
> >>>>>>
> dbname=# explain SELECT 
> t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
> FROM jobqueue t0  WHERE  (t0.status='P' OR t0.status='G') AND
> t0.checkaction='R' AND t0.checktime<=2414771559577 ORDER BY t0.docpriority
> LIMIT 240;
>                                            QUERY PLAN
>
>
> ------------------------------------------------------------------------------------------------
>  Limit  (cost=0.29..68.98 rows=240 width=161)
>    ->  Index Scan using i1413214448001 on jobqueue t0  (cost=0.29..2389.31
> rows=8347 width=161)
>          Index Cond: ((checkaction = 'R'::bpchar) AND (checktime <=
> 2414771559577::bigint))
>          Filter: ((status = 'P'::bpchar) OR (status = 'G'::bpchar))
> (4 rows)
> <<<<<<
>
> Now, this is a bit strange, because there is a filter on the status, so
> the index isn't helping there, and yet the index includes the status:
> "i1413214448001" btree (docpriority, status, checkaction, checktime)
> But maybe this is a statistical optimization?  I don't know.
>
> Anyhow, as soon as a NOT EXISTS clause is added, you see an anti-join:
>
> >>>>>>
> dbname=# explain SELECT 
> t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
> FROM jobqueue t0  WHERE  (t0.status='P' OR t0.status='G') AND
> t0.checkaction='R' AND t0.checktime<=2414771559577 AND NOT EXISTS(SELECT
> 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND t3.eventname=
> t4.name) ORDER BY t0.docpriority LIMIT 240;
>                                                 QUERY PLAN
>
>
> ----------------------------------------------------------------------------------------------------------
>  Limit  (cost=0.57..231.40 rows=240 width=161)
>    ->  Nested Loop Anti Join  (cost=0.57..7957.24 rows=8273 width=161)
>          ->  Index Scan using i1413214448001 on jobqueue t0
> (cost=0.29..2389.31 rows=8347 width=161)
>                Index Cond: ((checkaction = 'R'::bpchar) AND (checktime <=
> 2414771559577::bigint))
>                Filter: ((status = 'P'::bpchar) OR (status = 'G'::bpchar))
>          ->  Nested Loop  (cost=0.29..0.66 rows=1 width=8)
>                ->  Index Scan using i1413214448007 on prereqevents t3
> (cost=0.14..0.17 rows=1 width=524)
>                      Index Cond: (t0.id = owner)
>                ->  Index Only Scan using events_pkey on events t4
> (cost=0.14..0.48 rows=1 width=516)
>                      Index Cond: (name = (t3.eventname)::text)
> (10 rows)
> <<<<<<
>
> The index scan is still there and is apparently providing the sorting,
> which is good still.  It's just inside an anti-join.
>
> So you can take this further and add the two remaining EXISTS/NOT EXISTS
> clauses, and then your first plan makes sense.  But your proposed *second*
> plan is clearly *not* using the docpriority index for sorting, since it
> sorts by itself later on:
>
> >>>>>>
>  Limit  (cost=122440.72..122441.42 rows=280 width=177) (actual
> time=1010.722..1010.722 rows=0 loops=1)
>    ->  Sort  (cost=122440.72..122451.13 rows=4164 width=177) (actual
> time=1010.719..1010.719 rows=0 loops=1)
>          Sort Key: t0.docpriority
>          Sort Method: quicksort  Memory: 25kB
> <<<<<<
>
> So what we need to understand now is why the first query is slow.  It's
> reading from the correct index, but is nevertheless taking a long time,
> possibly because there are a lot of records with status 'P' or 'G' that
> belong to jobs that are not in state 'a' or 'A' that it has to go through
> before it assembles 240 results.  Can you confirm this picture?
>
> Thanks,
> Karl
>
>
> On Fri, Oct 31, 2014 at 1:09 PM, Aeham Abushwashi <
> [email protected]> wrote:
>
>> That's great news, thanks Karl! On my part, I've been digging deeper into
>> the execution plans of the two queries and I've made a similar observation
>> of the ORDER BY skewing the original query's plan significantly. However,
>> I
>> have yet to figure out why the same ORDER BY clause doesn't have a similar
>> impact on the modified query that uses JOIN. I'll look into this a bit
>> more
>> over the weekend.
>>
>> Best regards,
>> Aeham
>>
>
>

Re: (Continuous) crawl performance

Reply via email to