[PERFORM] Selectivity for lopsided foreign key columns
Hi all, I have an application that runs in production in multiple instances, and on one of these the performance of certain queries suddenly became truly abysmal. I basically know why, but I would much appreciate if I could obtain a deeper understanding of the selectivity function involved and any possible means of making Postgres choose a better plan. In the following I have tried to boil the problem down to something manageable: The schema contains two tables, t1 and t2. t2 has two fields, an id and a tag, and it contains 146 rows that are unique. t1 has two fields, a value and a foreign key referring to t2.id, and it contains 266177 rows. The application retrieves the rows in t1 that match a specific tag in t2, and it turned out that the contents of t1 were distributed in a very lopsided way, where more than 90% of the rows refer to one of two tags from t2: EXPLAIN SELECT(*) FROM t1 WHERE t2_id = '' Index Scan using t1_t2_id_idx on t1 (cost=0.42..7039.67 rows=103521 width=367) Index Cond: (t2_id = ''::text) The row count estimate is exactly as expected; about 39% of the rows refer to that specific tag. What the application actually does is EXPLAIN SELECT(*) FROM t1 INNER JOIN t2 ON t1.t2_id = t2.id WHERE t2.tag = '' Nested Loop (cost=0.69..3152.53 rows=1824 width=558) -> Index Scan using t2_tag_idx ON t2 (cost=0.27..2.29 rows=1 width=191) Index Cond: (tag = ''::text) -> Index Scan using t1_t2_id_idx on t1 (cost=0.42..3058.42 rows=9182 width=367) Index Cond: (t2_id = t2.id) The estimate for the number of rows in the result (1824) is way too low, and that leads to bad plans and queries involving more joins on the tables that run about 1000x slower than they should. I have currently rewritten the application code to do two queries; one to retrieve the id from t2 that matches the given tag and one to retrieve the rows from t1, and that's a usable workaround but not something we really like doing as a permanent solution. Fiddling with the various statistics related knobs seems to make no difference, but is there be some other way I can make Postgres assume high selectivity for certain tag values? Am I just SOL with the given schema? Any pointers to information about how to handle potentially lopsided data like this are highly welcome. Best regards, Mikkel Lauritsen -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Selectivity for lopsided foreign key columns
On 2015-12-17 16:23, Tom Lane wrote: Mikkel Lauritsen <ren...@tala.dk> writes: The schema contains two tables, t1 and t2. t2 has two fields, an id and a tag, and it contains 146 rows that are unique. t1 has two fields, a value and a foreign key referring to t2.id, and it contains 266177 rows. The application retrieves the rows in t1 that match a specific tag in t2, and it turned out that the contents of t1 were distributed in a very lopsided way, where more than 90% of the rows refer to one of two tags from t2: ... The estimate for the number of rows in the result (1824) is way too low, and that leads to bad plans and queries involving more joins on the tables that run about 1000x slower than they should. I have currently rewritten the application code to do two queries; one to retrieve the id from t2 that matches the given tag and one to retrieve the rows from t1, and that's a usable workaround but not something we really like doing as a permanent solution. Fiddling with the various statistics related knobs seems to make no difference, but is there be some other way I can make Postgres assume high selectivity for certain tag values? Am I just SOL with the given schema? You're pretty much SOL. Lacking cross-column statistics, the planner has no idea which t2.id goes with the given tag, so it can't see that the selected id is the one that is most common in t1. You're getting a join size estimate that is basically size of t1 divided by number of possible values (146), which is about the best we can do without knowing which id is selected. --- snip -- Thanks - I thought as much, but it's really nice to have it confirmed from people who are way more knowledgeable. Best regards and thanks again, Mikkel Lauritsen -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Reasons for choosing one execution plan over another?
I wrote: --- snip --- So - does anybody with enough insight in the planner know if it sounds likely that it would choose the given plans in these two cases, or if it's more likely that I have a tuning problem that leads to bad planning? Duh. It suddenly dawned on me that I need to look closer at the plans... The big difference in the estimated and actual row count in lines like - Nested Loop (cost=0.00..250.78 rows=338 width=47) (actual time=0.100..189.676 rows=187012 loops=1) indicates that the planner is somehow mislead by the statistics on (at least) one of the tables, right? Any suggestions as to how I go about investigating that further? One thing here that is slightly confusing is the relationship between the estimated row count of 169 in the outer loop and 6059 in the last index scan in the partial plan below. How do they relate to each other? - Nested Loop (cost=0.00..452.10 rows=169 width=47) (actual time=0.088..41.244 rows=32863 loops=1) - Nested Loop (cost=0.00..16.55 rows=1 width=39) (actual time=0.031..0.035 rows=1 loops=1) - Index Scan using i_c_id on i (cost=0.00..8.27 rows=1 width=39) (actual time=0.016..0.017 rows=1 loops=1) Index Cond: (c = 'bar'::text) - Index Scan using a_i_id_idx on a (cost=0.00..8.27 rows=1 width=78) (actual time=0.012..0.013 rows=1 loops=1) Index Cond: (i_id = i.id) - Index Scan using x_a_id_idx on x (cost=0.00..374.95 rows=6059 width=86) (actual time=0.055..27.219 rows=32863 loops=1) Index Cond: (a_id = a.id) Best regards thanks, Mikkel Lauritsen -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Reasons for choosing one execution plan over another?
Hi all, I have a number of Postgres 9.2.4 databases with the same schema but with slightly different contents, running on small servers that are basically alike (8-16 GB ram). When I run the same query on these databases it results in one of two different execution plans where one is much faster (appx. 50 times) than the other. Each database always gives the same plan, and vacuuming, updating statistics and reindexing doesn't seem to make any difference. Clearly the fast plan is preferred, but I haven't been able to identify any pattern (table sizes, tuning etc.) in why one plan is chosen over the other, so is there any way I can make Postgres tell me why it chooses to plan the way it does? As reference I have included the query and execution plans for two different databases below. The x and e tables contain about 5 and 10 million records respectively, and the performance difference is perfectly reasonable as the outer loop has to process 3576 rows in the fast case and 154149 rows in the slow case. Best regards thanks, Mikkel Lauritsen --- Query: SELECT x.r, e.id, a.id FROM x INNER JOIN e ON x.id = e.id INNER JOIN a ON x.a_id = a.id INNER JOIN i ON a.i_id = i.id WHERE e.h_id = 'foo' AND i.c = 'bar'; Fast plan: Nested Loop (cost=0.00..24553.77 rows=1 width=86) (actual time=2.810..102.451 rows=20 loops=1) Join Filter: (x.a_id = a.id) Rows Removed by Join Filter: 3556 - Nested Loop (cost=0.00..16.55 rows=1 width=39) (actual time=0.036..0.046 rows=3 loops=1) - Index Scan using i_c_idx on i (cost=0.00..8.27 rows=1 width=39) (actual time=0.019..0.020 rows=1 loops=1) Index Cond: (c = 'bar'::text) - Index Scan using a_i_id_idx on a (cost=0.00..8.27 rows=1 width=78) (actual time=0.014..0.021 rows=3 loops=1) Index Cond: (i_id = i.id) - Nested Loop (cost=0.00..24523.00 rows=1138 width=86) (actual time=2.641..33.818 rows=1192 loops=3) - Index Scan using e_h_id_idx on e (cost=0.00..6171.55 rows=1525 width=39) (actual time=0.049..1.108 rows=1857 loops=3) Index Cond: (h_id = 'foo'::text) - Index Scan using x_id_idx on x (cost=0.00..12.02 rows=1 width=86) (actual time=0.017..0.017 rows=1 loops=5571) Index Cond: (id = e.id) Total runtime: 102.526 ms Slow plan: Nested Loop (cost=0.00..858.88 rows=1 width=86) (actual time=89.430..2589.905 rows=11 loops=1) - Nested Loop (cost=0.00..448.38 rows=169 width=86) (actual time=0.135..142.246 rows=154149 loops=1) - Nested Loop (cost=0.00..16.55 rows=1 width=39) (actual time=0.056..0.064 rows=3 loops=1) - Index Scan using i_c_idx on i (cost=0.00..8.27 rows=1 width=39) (actual time=0.030..0.030 rows=1 loops=1) Index Cond: (c = 'bar'::text) - Index Scan using a_i_id_idx on a (cost=0.00..8.27 rows=1 width=78) (actual time=0.022..0.028 rows=3 loops=1) Index Cond: (i_id = i.id) - Index Scan using x_a_id_idx on x (cost=0.00..372.48 rows=5935 width=86) (actual time=0.065..35.479 rows=51383 loops=3) Index Cond: (a_id = a.id) - Index Scan using e_pkey on e (cost=0.00..2.42 rows=1 width=39) (actual time=0.015..0.015 rows=0 loops=154149) Index Cond: (id = x.id) Filter: (h_id = 'foo'::text) Rows Removed by Filter: 1 Total runtime: 2589.970 ms -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Reasons for choosing one execution plan over another?
Hi all, On Wed, 11 Sep 2013 18:55:38 +0200, Giuseppe Broccolo giuseppe.brocc...@2ndquadrant.it wrote: Il 11/09/2013 13:16, Mikkel Lauritsen ha scritto: Hi all, I have a number of Postgres 9.2.4 databases with the same schema but with slightly different contents, running on small servers that are basically alike (8-16 GB ram). I think that your answer can be found in your statement slightly different contents. Yup, that's what I've been thinking myself so far - it definitely doesn't look as if I'm hitting a bug, and I've been through the schema so I feel reasonably sure that I'm not missing an index. In the example from my original mail the i and a tables are identical in the two databases. The slow plan is chosen when the x and e tables contain 3.2M and 6.2M rows, the fast plan has 12.8M and 17M rows. So - does anybody with enough insight in the planner know if it sounds likely that it would choose the given plans in these two cases, or if it's more likely that I have a tuning problem that leads to bad planning? And if the different plans are to be expected, is there any way I can hint at the planner to make it choose the fast plan in both cases? FWIW if I do a very simple query like SELECT e.id FROM e INNER JOIN x USING (id) WHERE e.h_id = 'foo'; on the two databases I also end up with two different plans (included below). Here the execution time difference is much less pronounced (note that the fast execution is on inferior hardware and with a much larger result), but the way the join is executed is the same as in the initial larger plans. Setting seq_page_cost and random_page_cost to the same value makes the planner choose the fast plan in both cases, but unfortu- nately that has no effect on my initial problem :-/ Best regards thanks, Mikkel Lauritsen --- Fast plan: Nested Loop (cost=0.00..24523.00 rows=1138 width=39) (actual time=2.546..33.858 rows=1192 loops=1) Buffers: shared hit=8991 - Index Scan using e_h_id_idx on e (cost=0.00..6171.55 rows=1525 width=39) (actual time=0.053..1.211 rows=1857 loops=1) Index Cond: (healthtrack_id = '-95674114670403931535179954575983492851'::text) Buffers: shared hit=350 - Index Only Scan using x_pkey on x (cost=0.00..12.02 rows=1 width=39) (actual time=0.017..0.017 rows=1 loops=1857) Index Cond: (id = e.id) Heap Fetches: 1192 Buffers: shared hit=8641 Total runtime: 34.065 ms Slow plan: Nested Loop (cost=22.25..7020.66 rows=277 width=39) (actual time=0.298..13.996 rows=228 loops=1) Buffers: shared hit=3173 - Bitmap Heap Scan on e (cost=22.25..2093.50 rows=537 width=39) (actual time=0.219..0.628 rows=697 loops=1) Recheck Cond: (healthtrack_id = 'foo'::text) Buffers: shared hit=152 - Bitmap Index Scan on e_h_id_idx (cost=0.00..22.12 rows=537 width=0) (actual time=0.188..0.188 rows=697 loops=1) Index Cond: (h_id = 'foo'::text) Buffers: shared hit=9 - Index Only Scan using x_pkey on x (cost=0.00..9.17 rows=1 width=39) (actual time=0.018..0.018 rows=0 loops=697) Index Cond: (id = e.id) Heap Fetches: 228 Buffers: shared hit=3021 Total runtime: 14.070 ms -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] Different execution plans for semantically equivalent queries
Hi all, I have a execution planner related issue that I'd like to have some help in understanding a bit deeper - I have a table which basically contains fields for a value, a timestamp and a record type which is an integer. I would like to do a query which retrieves the newest record for each type, and the persistence framework that I'm using does something which is structurally like SELECT * FROM table t1 WHERE 0 = (SELECT COUNT(*) FROM table t2 WHERE t2.type = t1.type AND t2.timestamp t1.timestamp) On all of the PostgreSQL versions I've tried (9.0.2 included) this executes in about 20-30 seconds, which isn't exactly stellar. I've tried the (I think) equivalent SELECT * FROM table t1 WHERE NOT EXISTS (SELECT * FROM table t2 WHERE t2.type = t1.type AND t2.timestamp t1.timestamp) instead, and that executes in about 100 ms, so it's about 200 times faster. The two statements have completely different execution plans, so I understand why there is a difference in performance, but as I'm unable to modify the SQL that the persistence framework generates I'd like to know if there's anything that I can do in order to make the first query execute as fast as the second one. I'm more specifically thinking whether I'm missing out on a crucial planner configuration knob or something like that, which causes the planner to treat the two cases differently. Best regards thanks for an excellent database engine, Mikkel Lauritsen -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Different execution plans for semantically equivalent queries
Hi Tom et al, Many thanks for your prompt reply - you wrote: SELECT * FROM table t1 WHERE 0 = (SELECT COUNT(*) FROM table t2 WHERE t2.type = t1.type AND t2.timestamp t1.timestamp) I suspect that *any* database is going to have trouble optimizing that. Okay, I expected that much. Just out of curiosity I've been looking a bit at the optimizer code in PostgreSQL, and it seems as if it would be at least theoretically possible to add support for things like transforming the query at hand into the NOT EXISTS form; a bit like how = NULL is converted to IS NULL. Would a change like that be accepted, or would you rather try to indirectly educate people into writing better SQL? You'd be well advised to lobby the persistence framework's authors to produce less brain-dead SQL. The NOT EXISTS formulation seems to express what's wanted much less indirectly. Will do :-) For now I guess I'll hack it by wrapping a proxy around the JDBC driver and rewriting the SQL on the fly; I encounter other bugs in the persistence layer that are probably best handled that way as well. Best regards thanks, Mikkel -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance