[PERFORM] Size of IN list affects query plan

2013-11-08 Thread Jan Walter

Hi,

I would like to know, how does the size of the IN list affect query planner.
I have a query

select distinct on (event_id, tag_id) et.id,
   e.id as event_id, t.id as tag_id, t.name,
   t.user_id, t.shared, t.color,
   case
 when ea.id  e.id then true
 else false
   end as inherited
from do_event e
 join do_event ea on (ea.tree_id = e.tree_id and ea.lft = e.lft 
and ea.rght = e.rght)

 join do_event_tags et on (et.event_id = ea.id)
 join do_tag t on (t.id = et.tag_id)
where e.id in (LIST_OF_INTEGERS) and
  (t.user_id = 14 or t.shared)
order by event_id, tag_id, inherited;

and have doubts, if the size of the list does not impact the plan 
significantly.


If LIST_OF_INTEGERS has =233 values, the query is really fast:
 Unique  (cost=2351.85..2353.71 rows=249 width=33) (actual 
time=24.515..24.654 rows=163 loops=1)
   -  Sort  (cost=2351.85..2352.47 rows=249 width=33) (actual 
time=24.513..24.549 rows=166 loops=1)
 Sort Key: e.id, t.id, (CASE WHEN (ea.id  e.id) THEN true 
ELSE false END)

 Sort Method: quicksort  Memory: 37kB
 -  Hash Join  (cost=2217.89..2341.94 rows=249 width=33) 
(actual time=18.987..24.329 rows=166 loops=1)

   Hash Cond: (et.event_id = ea.id)
   -  Hash Join  (cost=4.73..119.62 rows=1612 width=29) 
(actual time=0.151..4.634 rows=2312 loops=1)

 Hash Cond: (et.tag_id = t.id)
 -  Seq Scan on do_event_tags et (cost=0.00..79.47 
rows=5147 width=12) (actual time=0.006..1.531 rows=5147 loops=1)
 -  Hash  (cost=4.08..4.08 rows=52 width=21) 
(actual time=0.119..0.119 rows=49 loops=1)

   Buckets: 1024  Batches: 1  Memory Usage: 3kB
   -  Seq Scan on do_tag t (cost=0.00..4.08 
rows=52 width=21) (actual time=0.019..0.097 rows=49 loops=1)

 Filter: ((user_id = 14) OR shared)
   -  Hash  (cost=2157.26..2157.26 rows=4472 width=8) 
(actual time=18.782..18.782 rows=270 loops=1)

 Buckets: 1024  Batches: 1  Memory Usage: 11kB
 -  Nested Loop  (cost=428.35..2157.26 rows=4472 
width=8) (actual time=0.597..18.595 rows=270 loops=1)
   Join Filter: ((ea.lft = e.lft) AND (ea.rght 
= e.rght))
   -  Bitmap Heap Scan on do_event e 
(cost=428.35..926.22 rows=232 width=16) (actual time=0.568..0.895 
rows=233 loops=1)
 Recheck Cond: (id = ANY 
('{110364,110377,42337,1503,5490,106267,106607,108419,108836,108556,108744,108466,108467,106331,3717,105404,35179,3398,5675,5896,5888,5287,4679,4275,4042,1599,4041,3311,1588,1605,1607,1606,1604,1594,1850,110494,110041,107955,110373,110068,110114,109503,109925,108959,108964,109189,109598,109142,109304,109607,107902,106668,109121,109101,109056,4621,109031,2574,5092,1674,106452,108901,108849,108713,108783,108766,108386,108455,2560,108397,1538,2007,108000,108389,108336,108456,36796,28985,108003,108421,108399,4871,106884,6371,36026,108204,108022,107941,107967,107911,107928,47944,107010,106640,107037,106994,107011,55313,105862,106332,106498,5850,13369,106161,5859,28465,106385,106444,102751,106371,105131,2610,102753,4833,4936,4755,4699,105402,14087,4798,4942,36249,55513,75790,75789,4238,6370,5744,5745,5149,4731,42297,34841,31190,17339,31155,31242,17701,17642,31203,31218,31376,5856,5141,18154,27146,17590,17566,13692,4867,1842,6365,6354,5480,5481,4382,5893,6355,5907,5886,5826,5028,4665,5230,5482,5273,4181,5091,4869,4983,4968,4961,4905,4906,4036,1483,4284,4790,4348,4648,4655,4647,4656,3075,4596,2144,4274,4592,4506,4549,4595,4188,4548,4511,4333,4306,4291,4240,4268,4114,3665,3547,1563,2102,1514,3579,3607,3501,2834,2436,3069,1400,2359,3056,3173,2897,2837,2780,2137,1447,1280,421,412,2076,1200,1691,446,1444,399,374,444,419,449}'::integer[]))
 -  Bitmap Index Scan on 
do_event_pkey  (cost=0.00..428.29 rows=232 width=0) (actual 
time=0.538..0.538 rows=233 loops=1)
   Index Cond: (id = ANY 

Re: [PERFORM] Size of IN list affects query plan

2013-11-08 Thread Tom Lane
Jan Walter j...@commontongue.com writes:
 I would like to know, how does the size of the IN list affect query planner.

AFAICT, the reason the second plan is slow is the large number of checks
of the IN list.  The planner does account for the cost of that, but it's
drastically underestimating that cost relative to the cost of I/O for the
heap and index accesses.  I suppose that your test case is fully cached in
memory, which helps make the CPU costs more important than I/O costs.
If you think this is representative of your real workload, then you
need to decrease random_page_cost (and maybe seq_page_cost too) to make
the cost estimates correspond better to that reality.

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Size of IN list affects query plan

2013-11-08 Thread bricklen
On Fri, Nov 8, 2013 at 6:04 AM, Jan Walter j...@commontongue.com wrote:

 Hi,

 I would like to know, how does the size of the IN list affect query
 planner.
 I have a query

 select distinct on (event_id, tag_id) et.id,
e.id as event_id, t.id as tag_id, t.name,
t.user_id, t.shared, t.color,
case
  when ea.id  e.id then true
  else false
end as inherited
 from do_event e
  join do_event ea on (ea.tree_id = e.tree_id and ea.lft = e.lft and
 ea.rght = e.rght)
  join do_event_tags et on (et.event_id = ea.id)
  join do_tag t on (t.id = et.tag_id)
 where e.id in (LIST_OF_INTEGERS) and
   (t.user_id = 14 or t.shared)
 order by event_id, tag_id, inherited;



Looking at your EXPLAIN ANALYZE plan I was immediately reminded of this
article
http://www.datadoghq.com/2013/08/100x-faster-postgres-performance-by-changing-1-line/,
where changing the array to a VALUES() clause was a huge win for them.


Re: [PERFORM] Performance bug in prepared statement binding in 9.2?

2013-11-08 Thread Tom Lane
Josh Berkus j...@agliodbs.com writes:
 The explanation is in
 http://archives.postgresql.org/message-id/20130910132133.GJ1024477%40alap2.anarazel.de
 
 The referenced commit introduced a planner feature. Funnily you seem to
 have been the trigger for it's introduction ;)

 Oh, crap, the off the end of the index optimization?

 It's the story of our lives: we can't optimize anything without
 deoptimizing something else.  Dammit.

I wonder if we could ameliorate this problem by making
get_actual_variable_range() use SnapshotDirty rather than either
SnapshotNow (as it does in released versions) or the active snapshot (as
it does in HEAD).  We discussed that idea in the SnapshotNow removal
thread, see eg
http://www.postgresql.org/message-id/CA+TgmoZ_q2KMkxZAoRxRHB7k1tOmjVjQgYt2JuA7=u7qzol...@mail.gmail.com
In that thread I claimed that a current MVCC snapshot was the most
appropriate thing, which it probably is; but the argument for it isn't so
strong that I think we should be willing to spend unbounded effort to get
that version of the column min/max rather than some other approximation.
The best objection to it that I can see is Robert's security concern about
leakage of uncommitted values --- but I don't think that holds a huge
amount of water either.  We already try to limit the visibility of the
regular elements of the histogram, why are these not-yet-committed values
significantly more of an issue?

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance