Re: (Continuous) crawl performance

Karl Wright Fri, 07 Nov 2014 09:37:55 -0800

Hi Aeham,

The query plans look reasonable.  The fact that all the document priorities
need to be written accounts for the high postgresql I/O.  About the only
thing I can think to do is put the writes in a transaction, which may help
performance somewhat.  I'll experiment with that over the weekend.


bq. I was also wondering, and this is just me speculating at this point,
whether the exact same set of docs would be re-processed every time another
crawler agent node is started.

That should not happen.  There are two different kinds of restarts -- one
where NO agents nodes are registered, and a second where there are already
agents nodes registered.  Reprioritization should not take place when the
secondary nodes start up.

However, reprioritization *does* take place whenever you start, stop,
pause, or resume a job.  I don't have a good solution for that issue yet.

Karl


On Fri, Nov 7, 2014 at 11:50 AM, Aeham Abushwashi <
[email protected]> wrote:

> With all crawler agents stopped, the current state of the system is as
> follows (using a variant of the reprioritization query):
>
> SELECT jobid, status, COUNT(*) FROM jobqueue t0 WHERE  (status='H' OR
> status='P' OR status='G') AND priorityset<1415374856000 AND checkaction='R'
> AND EXISTS(SELECT 'x' FROM jobs t1 WHERE  (t1.status='B' OR t1.status='b'
> OR t1.status='A' OR t1.status='a' OR t1.status='R' OR t1.status='r' OR
> t1.status='O' OR t1.status='o' OR t1.status='U' OR t1.status='u') AND
> t1.id=t0.jobid)
> GROUP BY jobid, status ORDER BY jobid;
>      jobid     | status |  count
> ---------------+--------+---------
>  1413897188357 | P      |   47281
>  1413897188357 | G      | 2420206
>  1414075755977 | G      |   43399
>  1415018782273 | G      |       9
>  1415020466369 | G      |      24
>
> Note that all 4 crawls are in continuous mode and status Active.
>
> When the first agent node is started, it iterates through the above docs
> and updates their priority. Starting with the following query:
>
> SELECT id,dochash,docid,jobid FROM jobqueue t0 WHERE  (status='H' OR
> status='P' OR status='G') AND priorityset< [some timestamp] AND
> checkaction='R' AND EXISTS(SELECT 'x' FROM jobs t1 WHERE  (t1.status='B' OR
> t1.status='b' OR t1.status='A' OR t1.status='a' OR t1.status='R' OR
> t1.status='r' OR t1.status='O' OR t1.status='o' OR t1.status='U' OR
> t1.status='u') AND t1.id=t0.jobid) LIMIT 10000
>
> which takes ~4 seconds for each batch. The AgentsThread then proceeds to
> update their priority one by one. This, the VisualVM profiler suggests,
> adds another ~9 seconds to each iteration and hits postgresql (IO) really
> hard. With ~2.5M docs to go through, the numbers gradually add up.
>
> I was also wondering, and this is just me speculating at this point,
> whether the exact same set of docs would be re-processed every time another
> crawler agent node is started. I say that because, the AgentsThread sets
> priorityset to current time which presumably would lead any subsequent
> starting node to hit the (moe or less) same set of documents as the
> previous one did.
>
>
> For reference, the execution plan of the SELECT QUERY is as follows:
>
>  EXPLAIN ANALYZE SELECT id,dochash,docid,jobid FROM jobqueue t0 WHERE
> (status='H' OR status='P' OR status='G') AND priorityset<1415374856000 AND
> checkaction='R' AND EXISTS(SELECT 'x' FROM jobs t1 WHERE  (t1.status='B' OR
> t1.status='b' OR t1.status='A' OR t1.status='a' OR t1.status='R' OR
> t1.status='r' OR t1.status='O' OR t1.status='o' OR t1.status='U' OR
> t1.status='u') AND t1.id=t0.jobid) LIMIT 10000;
>
> QUERY PLAN
>
>
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Limit  (cost=19486.41..57634.31 rows=10000 width=143) (actual
> time=1646.862..4869.705 rows=10000 loops=1)
>    ->  Hash Join  (cost=19486.41..1456197.07 rows=376616 width=143) (actual
> time=1646.859..4864.841 rows=10000 loops=1)
>          Hash Cond: (t0.jobid = t1.id)
>          ->  Bitmap Heap Scan on jobqueue t0  (cost=19401.16..1451681.36
> rows=690462 width=143) (actual time=1646.308..4096.874 rows=2768867
> loops=1)
>                Recheck Cond: (((status = 'H'::bpchar) AND (priorityset <
> 1415374856000::bigint)) OR ((status = 'P'::bpchar) AND (checkaction =
> 'R'::bpchar)) OR ((status = 'G'::bpchar) AND (checkaction = 'R'::bpchar)))
>                Filter: ((priorityset < 1415374856000::bigint) AND
> (checkaction = 'R'::bpchar))
>                ->  BitmapOr  (cost=19401.16..19401.16 rows=710369 width=0)
> (actual time=1456.979..1456.979 rows=0 loops=1)
>                      ->  Bitmap Index Scan on i1392985450173
> (cost=0.00..10.31 rows=1 width=0) (actual time=0.029..0.029 rows=0 loops=1)
>                            Index Cond: ((status = 'H'::bpchar) AND
> (priorityset < 1415374856000::bigint))
>                      ->  Bitmap Index Scan on i1392985450174
> (cost=0.00..5524.25 rows=207759 width=0) (actual time=349.283..349.283
> rows=1552934 loops=1)
>                            Index Cond: ((status = 'P'::bpchar) AND
> (checkaction = 'R'::bpchar))
>                      ->  Bitmap Index Scan on i1392985450174
> (cost=0.00..13348.76 rows=502610 width=0) (actual time=1107.663..1107.663
> rows=4266856 loops=1)
>                            Index Cond: ((status = 'G'::bpchar) AND
> (checkaction = 'R'::bpchar))
>          ->  Hash  (cost=85.10..85.10 rows=12 width=8) (actual
> time=0.074..0.074 rows=4 loops=1)
>                Buckets: 1024  Batches: 1  Memory Usage: 2kB
>                ->  Bitmap Heap Scan on jobs t1  (cost=46.64..85.10 rows=12
> width=8) (actual time=0.058..0.068 rows=4 loops=1)
>                      Recheck Cond: ((status = 'B'::bpchar) OR (status =
> 'b'::bpchar) OR (status = 'A'::bpchar) OR (status = 'a'::bpchar) OR (status
> = 'R'::bpchar) OR (status = 'r'::bpchar) OR (status = 'O'::bpchar) OR
> (status = 'o'::bpch
> ar) OR (status = 'U'::bpchar) OR (status = 'u'::bpchar))
>                      ->  BitmapOr  (cost=46.64..46.64 rows=13 width=0)
> (actual time=0.049..0.049 rows=0 loops=1)
>                            ->  Bitmap Index Scan on i1392985450183
> (cost=0.00..4.26 rows=1 width=0) (actual time=0.019..0.019 rows=0 loops=1)
>                                  Index Cond: (status = 'B'::bpchar)
>                            ->  Bitmap Index Scan on i1392985450183
> (cost=0.00..4.26 rows=1 width=0) (actual time=0.003..0.003 rows=0 loops=1)
>                                  Index Cond: (status = 'b'::bpchar)
>                            ->  Bitmap Index Scan on i1392985450183
> (cost=0.00..8.28 rows=4 width=0) (actual time=0.008..0.008 rows=4 loops=1)
>                                  Index Cond: (status = 'A'::bpchar)
>                            ->  Bitmap Index Scan on i1392985450183
> (cost=0.00..4.26 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=1)
>                                  Index Cond: (status = 'a'::bpchar)
>                            ->  Bitmap Index Scan on i1392985450183
> (cost=0.00..4.26 rows=1 width=0) (actual time=0.004..0.004 rows=0 loops=1)
>                                  Index Cond: (status = 'R'::bpchar)
>                            ->  Bitmap Index Scan on i1392985450183
> (cost=0.00..4.26 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=1)
>                                  Index Cond: (status = 'r'::bpchar)
>                            ->  Bitmap Index Scan on i1392985450183
> (cost=0.00..4.26 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=1)
>                                  Index Cond: (status = 'O'::bpchar)
>                            ->  Bitmap Index Scan on i1392985450183
> (cost=0.00..4.26 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=1)
>                                  Index Cond: (status = 'o'::bpchar)
>                            ->  Bitmap Index Scan on i1392985450183
> (cost=0.00..4.26 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=1)
>                                  Index Cond: (status = 'U'::bpchar)
>                            ->  Bitmap Index Scan on i1392985450183
> (cost=0.00..4.26 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=1)
>                                  Index Cond: (status = 'u'::bpchar)
>  Total runtime: 4872.335 ms
>

Re: (Continuous) crawl performance

Reply via email to