Re: (Continuous) crawl performance

Karl Wright Sun, 02 Nov 2014 17:53:05 -0800

Unfortunately, I've found that this change triggers another bug.  Please
hold off until I have a fix for that one as well.


Karl


On Sun, Nov 2, 2014 at 9:34 AM, Karl Wright <[email protected]> wrote:

> Hi Aeham,
>
> The more I look at this, the more I'm convinced that there must be a bug
> elsewhere that permits jobqueue records with non-null document priorities
> to hang around.
>
> If you look at this JobManager method:
>
> public void finishJobStops(long timestamp, ArrayList modifiedJobs)
>     throws ManifoldCFException
>
> ... which is used whenever a job is stopped for any reason, you will note
> that the logic in question is *supposed* to clear the document priority
> field for all documents belonging to that job.  If that works as it is
> supposed to, then there should be *no* documents in the docpriority index
> with a document priority which belong to inactive or paused jobs.
>
> So the only way that this index can be slow is if, perchance, PostgreSQL
> puts "null" values for a field FIRST in the index.  But:
>
> >>>>>>
>   /** Clear all document priorities for a job */
>   public void clearDocPriorities(Long jobID)
>     throws ManifoldCFException
>   {
>     HashMap map = new HashMap();
>     map.put(prioritySetField,null);
>     map.put(docPriorityField,nullDocPriority);
>     ArrayList list = new ArrayList();
>     String query = buildConjunctionClause(list,new ClauseDescription[]{
>       new UnitaryClause(jobIDField,jobID)});
>     performUpdate(map,"WHERE "+query,list,null);
>     noteModifications(0,1,0);
>   }
> <<<<<<
>
> ... where "nullDocPriority" is declared as:
>
> >>>>>>
>   public static final double noDocPriorityValue = 1e9;
>   public static final Double nullDocPriority = new
> Double(noDocPriorityValue + 1.0);
> <<<<<<
>
> ... which is a value picked to be last, always.  (It was necessary to pick
> a specific value because MySQL did not put nulls at the end, but rather at
> the beginning.)
>
> So now I am very curious as to what is *really* at the front of this index
> -- so please execute the query I sent and let me know.
>
> It occurs to me that there is another way for the query to be slow,
> however -- which is if there are insufficient matching rows to meet the
> LIMIT size.  In this case, the entire index would be scanned.  But this can
> be readily addressed by simply putting in an appropriate WHERE condition:
>
> >>>>>>
> new UnitaryClause("t0."+jobQueue.docPriorityField,"<",nullDocPriority),
> <<<<<<
>
> If the results of the query I asked you to run look fine, then we may want
> to try this next.  I've attached a second patch to CONNECTORS-1090.  Please
> apply both patches to see if that fixes your query execution time.
>
>
> Thanks,
> Karl
>
>
>
> On Sat, Nov 1, 2014 at 8:09 AM, Karl Wright <[email protected]> wrote:
>
>> Hi Aeham,
>>
>> To figure out what is going on, it would probably be sufficient to
>> execute the following queries and send me the results:
>>
>> EXPLAIN SELECT 
>> t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
>> FROM jobqueue t0 WHERE t0.checkaction='R' AND t0.checktime<=2414771559577
>> ORDER BY t0.docpriority LIMIT 20;
>> SELECT 
>> t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
>> FROM jobqueue t0 WHERE t0.checkaction='R' AND t0.checktime<=2414771559577
>> ORDER BY t0.docpriority LIMIT 20;
>>
>> These should both read directly from the index in question; we can see
>> immediately what kinds of documents it is having to skip to get to the
>> active ones.
>>
>> Thanks!
>> Karl
>>
>>
>> On Fri, Oct 31, 2014 at 2:42 PM, Karl Wright <[email protected]> wrote:
>>
>>> Hi Aeham,
>>>
>>> I've looked at this at some level of detail now, and have some
>>> conclusions.
>>>
>>> (1) The first query you provided is in fact using the proper index for
>>> ordering; it's just not obvious from the plan.  The way you can tell is if
>>> there is an explicit sort/limit.  The primary jobqueue index scan is
>>> happening within an anti-join and a semi-join; you can tell this by cutting
>>> out all the EXISTS clauses and getting a reduced plan that way, and
>>> comparing.  For instance, here's an explain of the simple query:
>>>
>>> >>>>>>
>>> dbname=# explain SELECT 
>>> t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
>>> FROM jobqueue t0  ORDER BY t0.docpriority LIMIT 240;
>>>                                            QUERY PLAN
>>>
>>>
>>> -------------------------------------------------------------------------------------------------
>>>  Limit  (cost=0.29..38.20 rows=240 width=161)
>>>    ->  Index Scan using i1413214448001 on jobqueue t0
>>> (cost=0.29..2312.39 rows=14634 width=161)
>>> (2 rows)
>>> <<<<<<
>>>
>>> Applying filtering criteria:
>>>
>>> >>>>>>
>>> dbname=# explain SELECT 
>>> t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
>>> FROM jobqueue t0  WHERE  (t0.status='P' OR t0.status='G') AND
>>> t0.checkaction='R' AND t0.checktime<=2414771559577 ORDER BY t0.docpriority
>>> LIMIT 240;
>>>                                            QUERY PLAN
>>>
>>>
>>> ------------------------------------------------------------------------------------------------
>>>  Limit  (cost=0.29..68.98 rows=240 width=161)
>>>    ->  Index Scan using i1413214448001 on jobqueue t0
>>> (cost=0.29..2389.31 rows=8347 width=161)
>>>          Index Cond: ((checkaction = 'R'::bpchar) AND (checktime <=
>>> 2414771559577::bigint))
>>>          Filter: ((status = 'P'::bpchar) OR (status = 'G'::bpchar))
>>> (4 rows)
>>> <<<<<<
>>>
>>> Now, this is a bit strange, because there is a filter on the status, so
>>> the index isn't helping there, and yet the index includes the status:
>>> "i1413214448001" btree (docpriority, status, checkaction, checktime)
>>> But maybe this is a statistical optimization?  I don't know.
>>>
>>> Anyhow, as soon as a NOT EXISTS clause is added, you see an anti-join:
>>>
>>> >>>>>>
>>> dbname=# explain SELECT 
>>> t0.id,t0.jobid,t0.dochash,t0.docid,t0.status,t0.failtime,t0.failcount,t0.priorityset
>>> FROM jobqueue t0  WHERE  (t0.status='P' OR t0.status='G') AND
>>> t0.checkaction='R' AND t0.checktime<=2414771559577 AND NOT EXISTS(SELECT
>>> 'x' FROM prereqevents t3,events t4 WHERE t0.id=t3.owner AND
>>> t3.eventname=t4.name) ORDER BY t0.docpriority LIMIT 240;
>>>                                                 QUERY PLAN
>>>
>>>
>>> ----------------------------------------------------------------------------------------------------------
>>>  Limit  (cost=0.57..231.40 rows=240 width=161)
>>>    ->  Nested Loop Anti Join  (cost=0.57..7957.24 rows=8273 width=161)
>>>          ->  Index Scan using i1413214448001 on jobqueue t0
>>> (cost=0.29..2389.31 rows=8347 width=161)
>>>                Index Cond: ((checkaction = 'R'::bpchar) AND (checktime
>>> <= 2414771559577::bigint))
>>>                Filter: ((status = 'P'::bpchar) OR (status = 'G'::bpchar))
>>>          ->  Nested Loop  (cost=0.29..0.66 rows=1 width=8)
>>>                ->  Index Scan using i1413214448007 on prereqevents t3
>>> (cost=0.14..0.17 rows=1 width=524)
>>>                      Index Cond: (t0.id = owner)
>>>                ->  Index Only Scan using events_pkey on events t4
>>> (cost=0.14..0.48 rows=1 width=516)
>>>                      Index Cond: (name = (t3.eventname)::text)
>>> (10 rows)
>>> <<<<<<
>>>
>>> The index scan is still there and is apparently providing the sorting,
>>> which is good still.  It's just inside an anti-join.
>>>
>>> So you can take this further and add the two remaining EXISTS/NOT EXISTS
>>> clauses, and then your first plan makes sense.  But your proposed *second*
>>> plan is clearly *not* using the docpriority index for sorting, since it
>>> sorts by itself later on:
>>>
>>> >>>>>>
>>>  Limit  (cost=122440.72..122441.42 rows=280 width=177) (actual
>>> time=1010.722..1010.722 rows=0 loops=1)
>>>    ->  Sort  (cost=122440.72..122451.13 rows=4164 width=177) (actual
>>> time=1010.719..1010.719 rows=0 loops=1)
>>>          Sort Key: t0.docpriority
>>>          Sort Method: quicksort  Memory: 25kB
>>> <<<<<<
>>>
>>> So what we need to understand now is why the first query is slow.  It's
>>> reading from the correct index, but is nevertheless taking a long time,
>>> possibly because there are a lot of records with status 'P' or 'G' that
>>> belong to jobs that are not in state 'a' or 'A' that it has to go through
>>> before it assembles 240 results.  Can you confirm this picture?
>>>
>>> Thanks,
>>> Karl
>>>
>>>
>>> On Fri, Oct 31, 2014 at 1:09 PM, Aeham Abushwashi <
>>> [email protected]> wrote:
>>>
>>>> That's great news, thanks Karl! On my part, I've been digging deeper
>>>> into
>>>> the execution plans of the two queries and I've made a similar
>>>> observation
>>>> of the ORDER BY skewing the original query's plan significantly.
>>>> However, I
>>>> have yet to figure out why the same ORDER BY clause doesn't have a
>>>> similar
>>>> impact on the modified query that uses JOIN. I'll look into this a bit
>>>> more
>>>> over the weekend.
>>>>
>>>> Best regards,
>>>> Aeham
>>>>
>>>
>>>
>>
>

Re: (Continuous) crawl performance

Reply via email to