Re: [PERFORM] Slow query with indexed ORDER BY and LIMIT when using OR'd conditions
No, it doesn't. Read it again ... or read up on row comparisons, if you're unfamiliar with that notation. The above queries are exactly equivalent per spec. Wow, this is great. Thanks.
[PERFORM] Slow query with indexed ORDER BY and LIMIT when using OR'd conditions
Hi there, I am trying to optimize a simple query that returns first 100 rows that have been updated since a given timestamp (ordered by timestamp and id desc). If there are several rows with the same timestamp I need to a second condition, that states that I want to return rows having the given timestamp and id given id. The obvious query is SELECT * FROM register_uz_accounting_entities WHERE effective_on '2014-07-11' OR (effective_on = '2014-07-11' AND id 1459) ORDER BY effective_on, id LIMIT 100 With a composite index on (effective_on, id) Query plan Limit (cost=4613.70..4613.95 rows=100 width=1250) (actual time=0.122..0.130 rows=22 loops=1) Buffers: shared hit=28 - Sort (cost=4613.70..4617.33 rows=1453 width=1250) (actual time=0.120..0.122 rows=22 loops=1) Sort Key: effective_on, id Sort Method: quicksort Memory: 30kB Buffers: shared hit=28 - Bitmap Heap Scan on register_uz_accounting_entities (cost=35.42..4558.17 rows=1453 width=1250) (actual time=0.036..0.083 rows=22 loops=1) Recheck Cond: ((effective_on '2014-07-11'::date) OR ((effective_on = '2014-07-11'::date) AND (id 1459))) Buffers: shared hit=28 - BitmapOr (cost=35.42..35.42 rows=1453 width=0) (actual time=0.026..0.026 rows=0 loops=1) Buffers: shared hit=6 - Bitmap Index Scan on idx2 (cost=0.00..6.49 rows=275 width=0) (actual time=0.017..0.017 rows=15 loops=1) Index Cond: (effective_on '2014-07-11'::date) Buffers: shared hit=3 - Bitmap Index Scan on idx2 (cost=0.00..28.21 rows=1178 width=0) (actual time=0.007..0.007 rows=7 loops=1) Index Cond: ((effective_on = '2014-07-11'::date) AND (id 1459)) Buffers: shared hit=3 Total runtime: 0.204 ms Everything works as expected. However if I change the constraint to a timestamp/date more in the past (resulting in far more matching rows) the query slows down drastically. SELECT * FROM register_uz_accounting_entities WHERE effective_on '2014-06-11' OR (effective_on = '2014-06-11' AND id 1459) ORDER BY effective_on, id LIMIT 100 Limit (cost=0.42..649.46 rows=100 width=1250) (actual time=516.125..516.242 rows=100 loops=1) Buffers: shared hit=576201 - Index Scan using idx2 on register_uz_accounting_entities (cost=0.42..106006.95 rows=16333 width=1250) (actual time=516.122..516.229 rows=100 loops=1) Filter: ((effective_on '2014-06-11'::date) OR ((effective_on = '2014-06-11'::date) AND (id 1459))) Rows Removed by Filter: 797708 Buffers: shared hit=576201 Total runtime: 516.304 ms I've tried to optimize this query by pushing down the limit and order by's into explicit subselects. SELECT * FROM ( SELECT * FROM register_uz_accounting_entities WHERE effective_on '2014-06-11' ORDER BY effective_on, id LIMIT 100 ) t1 UNION SELECT * FROM ( SELECT * FROM register_uz_accounting_entities WHERE effective_on = '2014-06-11' AND id 1459 ORDER BY effective_on, id LIMIT 100 ) t2 ORDER BY effective_on, id LIMIT 100 -- query plan Limit (cost=684.29..684.54 rows=100 width=1250) (actual time=2.648..2.708 rows=100 loops=1) Buffers: shared hit=203 - Sort (cost=684.29..684.79 rows=200 width=1250) (actual time=2.646..2.672 rows=100 loops=1) Sort Key: register_uz_accounting_entities.effective_on, register_uz_accounting_entities.id Sort Method: quicksort Memory: 79kB Buffers: shared hit=203 - HashAggregate (cost=674.65..676.65 rows=200 width=1250) (actual time=1.738..1.971 rows=200 loops=1) Buffers: shared hit=203 - Append (cost=0.42..661.15 rows=200 width=1250) (actual time=0.054..0.601 rows=200 loops=1) Buffers: shared hit=203 - Limit (cost=0.42..338.62 rows=100 width=1250) (actual time=0.053..0.293 rows=100 loops=1) Buffers: shared hit=101 - Index Scan using idx2 on register_uz_accounting_entities (cost=0.42..22669.36 rows=6703 width=1250) (actual time=0.052..0.260 rows=100 loops=1) Index Cond: (effective_on '2014-06-11'::date) Buffers: shared hit=101 - Limit (cost=0.42..318.53 rows=100 width=1250) (actual time=0.037..0.228 rows=100 loops=1) Buffers: shared hit=102 - Index Scan using idx2 on register_uz_accounting_entities register_uz_accounting_entities_1 (cost=0.42..30888.88 rows=9710 width=1250) (actual time=0.036..0.187 rows=100 loops=1) Index Cond: ((effective_on = '2014-06-11'::date) AND (id 1459)) Buffers: shared hit=102
Re: [PERFORM] Re: Slow query with indexed ORDER BY and LIMIT when using OR'd conditions
Thanks for the quick reply David! However I am still unsure how these two queries are not relationally equivalent. I am struggling to find a counterexample where the first and third query (in email, not in gist) would yield different results. Any ideas? Jano On Mon, Jul 21, 2014 at 11:31 PM, David G Johnston david.g.johns...@gmail.com wrote: johno wrote The question is... why is the query planner unable to make this optimization for the slow query? What am I missing? Short answer - your first and last queries are not relationally equivalent and the optimizer cannot change the behavior of the query which it is optimizing. i.e. you did not make an optimization but rather choose to reformulate the question so that it could be answered more easily while still providing an acceptable answer. The question main question is better phrased as: Give me 100 updated at t(0) but only that are subsequent to a given ID. If there are less than 100 such records give me enough additional rows having t t(0) so that the total number of rows returned is equal to 100. Both queries give the same answer but only due to the final LIMIT 100. They arrive there in different ways which necessitates generating different plans. At a basic level it is unable to push down LIMIT into a WHERE clause and it cannot add additional sub-queries that do not exist in the original plan - which includes adding a UNION node. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Slow-query-with-indexed-ORDER-BY-and-LIMIT-when-using-OR-d-conditions-tp5812282p5812285.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Re: Slow query with indexed ORDER BY and LIMIT when using OR'd conditions
Oh, yes I do understand that if I remove the outer limit, the semantics of the query would change. However I am looking for the counterexample *with* the limit clauses. Maybe I just don't understand what relationally equivalent means, sorry about that. BTW this is to my understanding a very similar scenario to how partitioned tables work and push down limit and where conditions. Why is this not possible in this case? Jano On Mon, Jul 21, 2014 at 11:54 PM, David G Johnston david.g.johns...@gmail.com wrote: johno wrote Thanks for the quick reply David! However I am still unsure how these two queries are not relationally equivalent. I am struggling to find a counterexample where the first and third query (in email, not in gist) would yield different results. Any ideas? Remove the outer LIMIT 100 from both queries... The first query would return a maximal number of rows that meet the OR criteria while the second query would return at most 200 rows since both sub-queries would still have their own independent LIMIT 100 clauses. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Slow-query-with-indexed-ORDER-BY-and-LIMIT-when-using-OR-d-conditions-tp5812282p5812289.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Re: Slow query with indexed ORDER BY and LIMIT when using OR'd conditions
Try following my lead and bottom-post, please. Sorry for that. Anyway, the query has no clue that because of the final LIMIT 100 that the two different feeding queries are just going to happen to end up providing the same result. Maybe, in this particular instance, it is theoretically possible to make such a proof but generally that is not the case and so such an optimization has not made into the codebase even if it theoretically could be done (I'm not convinced it could but do not know enough to explain to someone else why I have that impression). I do not know enough to answer why this situation is any different from a similar partitioning scenario. An example showing exactly what a similar partitioning query looks like would help in this regard. If you are looking for considerably more insight into the planner workings and why it does or doesn't do something you will need to wait for others. I can, to a reasonable degree, deconstruct a pair of queries and either explain or guess as to why things are happening but that is mostly applied deductive reasoning and not because I have any particular insight into the codebase. Thanks again for your time. Let's just wait for someone else and see where this will end up going. Jano
Re: [PERFORM] Slow query with indexed ORDER BY and LIMIT when using OR'd conditions
On Tue, Jul 22, 2014 at 4:53 AM, Tom Lane t...@sss.pgh.pa.us wrote: johno jan.suc...@gmail.com writes: I am trying to optimize a simple query that returns first 100 rows that have been updated since a given timestamp (ordered by timestamp and id desc). If there are several rows with the same timestamp I need to a second condition, that states that I want to return rows having the given timestamp and id given id. The obvious query is SELECT * FROM register_uz_accounting_entities WHERE effective_on '2014-07-11' OR (effective_on = '2014-07-11' AND id 1459) ORDER BY effective_on, id LIMIT 100 A more readily optimizable query is SELECT * FROM register_uz_accounting_entities WHERE (effective_on, id) ('2014-07-11'::date, 1459) ORDER BY effective_on, id LIMIT 100 Yes, but that query has completely different semantics - I can't change that. This formulation allows the planner to match both the WHERE and ORDER BY clauses directly to the two-column index. Are both fields really used? I was under the impression that only the first column from index can be used when there is a range query. I've tried to optimize this query by pushing down the limit and order by's into explicit subselects. As noted earlier, that's unlikely to be an improvement, because on its face it specifies more computation. Postgres is not terribly bright about UNIONs, either. Despite the cost calculation in explain the actual query times are very different. I get consistent sub 50ms responses from the optimized one (union with pushing down the limits) and 500+ms for the plain one (when not using bitmap index scan). Is this possible optimization considered by query planner or do I have force it? Thanks again for your time and effort, I appreciate it. regards, tom lane