Hi Aeham, The query plans look reasonable. The fact that all the document priorities need to be written accounts for the high postgresql I/O. About the only thing I can think to do is put the writes in a transaction, which may help performance somewhat. I'll experiment with that over the weekend.
bq. I was also wondering, and this is just me speculating at this point, whether the exact same set of docs would be re-processed every time another crawler agent node is started. That should not happen. There are two different kinds of restarts -- one where NO agents nodes are registered, and a second where there are already agents nodes registered. Reprioritization should not take place when the secondary nodes start up. However, reprioritization *does* take place whenever you start, stop, pause, or resume a job. I don't have a good solution for that issue yet. Karl On Fri, Nov 7, 2014 at 11:50 AM, Aeham Abushwashi < [email protected]> wrote: > With all crawler agents stopped, the current state of the system is as > follows (using a variant of the reprioritization query): > > SELECT jobid, status, COUNT(*) FROM jobqueue t0 WHERE (status='H' OR > status='P' OR status='G') AND priorityset<1415374856000 AND checkaction='R' > AND EXISTS(SELECT 'x' FROM jobs t1 WHERE (t1.status='B' OR t1.status='b' > OR t1.status='A' OR t1.status='a' OR t1.status='R' OR t1.status='r' OR > t1.status='O' OR t1.status='o' OR t1.status='U' OR t1.status='u') AND > t1.id=t0.jobid) > GROUP BY jobid, status ORDER BY jobid; > jobid | status | count > ---------------+--------+--------- > 1413897188357 | P | 47281 > 1413897188357 | G | 2420206 > 1414075755977 | G | 43399 > 1415018782273 | G | 9 > 1415020466369 | G | 24 > > Note that all 4 crawls are in continuous mode and status Active. > > When the first agent node is started, it iterates through the above docs > and updates their priority. Starting with the following query: > > SELECT id,dochash,docid,jobid FROM jobqueue t0 WHERE (status='H' OR > status='P' OR status='G') AND priorityset< [some timestamp] AND > checkaction='R' AND EXISTS(SELECT 'x' FROM jobs t1 WHERE (t1.status='B' OR > t1.status='b' OR t1.status='A' OR t1.status='a' OR t1.status='R' OR > t1.status='r' OR t1.status='O' OR t1.status='o' OR t1.status='U' OR > t1.status='u') AND t1.id=t0.jobid) LIMIT 10000 > > which takes ~4 seconds for each batch. The AgentsThread then proceeds to > update their priority one by one. This, the VisualVM profiler suggests, > adds another ~9 seconds to each iteration and hits postgresql (IO) really > hard. With ~2.5M docs to go through, the numbers gradually add up. > > I was also wondering, and this is just me speculating at this point, > whether the exact same set of docs would be re-processed every time another > crawler agent node is started. I say that because, the AgentsThread sets > priorityset to current time which presumably would lead any subsequent > starting node to hit the (moe or less) same set of documents as the > previous one did. > > > For reference, the execution plan of the SELECT QUERY is as follows: > > EXPLAIN ANALYZE SELECT id,dochash,docid,jobid FROM jobqueue t0 WHERE > (status='H' OR status='P' OR status='G') AND priorityset<1415374856000 AND > checkaction='R' AND EXISTS(SELECT 'x' FROM jobs t1 WHERE (t1.status='B' OR > t1.status='b' OR t1.status='A' OR t1.status='a' OR t1.status='R' OR > t1.status='r' OR t1.status='O' OR t1.status='o' OR t1.status='U' OR > t1.status='u') AND t1.id=t0.jobid) LIMIT 10000; > > QUERY PLAN > > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > > Limit (cost=19486.41..57634.31 rows=10000 width=143) (actual > time=1646.862..4869.705 rows=10000 loops=1) > -> Hash Join (cost=19486.41..1456197.07 rows=376616 width=143) (actual > time=1646.859..4864.841 rows=10000 loops=1) > Hash Cond: (t0.jobid = t1.id) > -> Bitmap Heap Scan on jobqueue t0 (cost=19401.16..1451681.36 > rows=690462 width=143) (actual time=1646.308..4096.874 rows=2768867 > loops=1) > Recheck Cond: (((status = 'H'::bpchar) AND (priorityset < > 1415374856000::bigint)) OR ((status = 'P'::bpchar) AND (checkaction = > 'R'::bpchar)) OR ((status = 'G'::bpchar) AND (checkaction = 'R'::bpchar))) > Filter: ((priorityset < 1415374856000::bigint) AND > (checkaction = 'R'::bpchar)) > -> BitmapOr (cost=19401.16..19401.16 rows=710369 width=0) > (actual time=1456.979..1456.979 rows=0 loops=1) > -> Bitmap Index Scan on i1392985450173 > (cost=0.00..10.31 rows=1 width=0) (actual time=0.029..0.029 rows=0 loops=1) > Index Cond: ((status = 'H'::bpchar) AND > (priorityset < 1415374856000::bigint)) > -> Bitmap Index Scan on i1392985450174 > (cost=0.00..5524.25 rows=207759 width=0) (actual time=349.283..349.283 > rows=1552934 loops=1) > Index Cond: ((status = 'P'::bpchar) AND > (checkaction = 'R'::bpchar)) > -> Bitmap Index Scan on i1392985450174 > (cost=0.00..13348.76 rows=502610 width=0) (actual time=1107.663..1107.663 > rows=4266856 loops=1) > Index Cond: ((status = 'G'::bpchar) AND > (checkaction = 'R'::bpchar)) > -> Hash (cost=85.10..85.10 rows=12 width=8) (actual > time=0.074..0.074 rows=4 loops=1) > Buckets: 1024 Batches: 1 Memory Usage: 2kB > -> Bitmap Heap Scan on jobs t1 (cost=46.64..85.10 rows=12 > width=8) (actual time=0.058..0.068 rows=4 loops=1) > Recheck Cond: ((status = 'B'::bpchar) OR (status = > 'b'::bpchar) OR (status = 'A'::bpchar) OR (status = 'a'::bpchar) OR (status > = 'R'::bpchar) OR (status = 'r'::bpchar) OR (status = 'O'::bpchar) OR > (status = 'o'::bpch > ar) OR (status = 'U'::bpchar) OR (status = 'u'::bpchar)) > -> BitmapOr (cost=46.64..46.64 rows=13 width=0) > (actual time=0.049..0.049 rows=0 loops=1) > -> Bitmap Index Scan on i1392985450183 > (cost=0.00..4.26 rows=1 width=0) (actual time=0.019..0.019 rows=0 loops=1) > Index Cond: (status = 'B'::bpchar) > -> Bitmap Index Scan on i1392985450183 > (cost=0.00..4.26 rows=1 width=0) (actual time=0.003..0.003 rows=0 loops=1) > Index Cond: (status = 'b'::bpchar) > -> Bitmap Index Scan on i1392985450183 > (cost=0.00..8.28 rows=4 width=0) (actual time=0.008..0.008 rows=4 loops=1) > Index Cond: (status = 'A'::bpchar) > -> Bitmap Index Scan on i1392985450183 > (cost=0.00..4.26 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=1) > Index Cond: (status = 'a'::bpchar) > -> Bitmap Index Scan on i1392985450183 > (cost=0.00..4.26 rows=1 width=0) (actual time=0.004..0.004 rows=0 loops=1) > Index Cond: (status = 'R'::bpchar) > -> Bitmap Index Scan on i1392985450183 > (cost=0.00..4.26 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=1) > Index Cond: (status = 'r'::bpchar) > -> Bitmap Index Scan on i1392985450183 > (cost=0.00..4.26 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=1) > Index Cond: (status = 'O'::bpchar) > -> Bitmap Index Scan on i1392985450183 > (cost=0.00..4.26 rows=1 width=0) (actual time=0.001..0.001 rows=0 loops=1) > Index Cond: (status = 'o'::bpchar) > -> Bitmap Index Scan on i1392985450183 > (cost=0.00..4.26 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=1) > Index Cond: (status = 'U'::bpchar) > -> Bitmap Index Scan on i1392985450183 > (cost=0.00..4.26 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=1) > Index Cond: (status = 'u'::bpchar) > Total runtime: 4872.335 ms >
