[PERFORM] Selectivity for lopsided foreign key columns

2015-12-17 Thread Mikkel Lauritsen

Hi all,

I have an application that runs in production in multiple instances, and 
on one of these the performance of certain queries suddenly became truly 
abysmal. I basically know why, but I would much appreciate if I could 
obtain a deeper understanding of the selectivity function involved and 
any possible means of making Postgres choose a better plan. In the 
following I have tried to boil the problem down to something manageable:


The schema contains two tables, t1 and t2.
t2 has two fields, an id and a tag, and it contains 146 rows that are 
unique.
t1 has two fields, a value and a foreign key referring to t2.id, and it 
contains 266177 rows.


The application retrieves the rows in t1 that match a specific tag in 
t2, and it turned out that the contents of t1 were distributed in a very 
lopsided way, where more than 90% of the rows refer to one of two tags 
from t2:


EXPLAIN SELECT(*) FROM t1 WHERE t2_id = ''

Index Scan using t1_t2_id_idx on t1  (cost=0.42..7039.67 rows=103521 
width=367)

  Index Cond: (t2_id = ''::text)

The row count estimate is exactly as expected; about 39% of the rows 
refer to that specific tag.


What the application actually does is

EXPLAIN SELECT(*) FROM t1 INNER JOIN t2 ON t1.t2_id = t2.id WHERE t2.tag 
= ''


Nested Loop (cost=0.69..3152.53 rows=1824 width=558)
  ->  Index Scan using t2_tag_idx ON t2  (cost=0.27..2.29 rows=1 
width=191)

Index Cond: (tag = ''::text)
  ->  Index Scan using t1_t2_id_idx on t1  (cost=0.42..3058.42 rows=9182 
width=367)

Index Cond: (t2_id = t2.id)

The estimate for the number of rows in the result (1824) is way too low, 
and that leads to bad plans and queries involving more joins on the 
tables that run about 1000x slower than they should.


I have currently rewritten the application code to do two queries; one 
to retrieve the id from t2 that matches the given tag and one to 
retrieve the rows from t1, and that's a usable workaround but not 
something we really like doing as a permanent solution. Fiddling with 
the various statistics related knobs seems to make no difference, but is 
there be some other way I can make Postgres assume high selectivity for 
certain tag values? Am I just SOL with the given schema?


Any pointers to information about how to handle potentially lopsided 
data like this are highly welcome.


Best regards,
  Mikkel Lauritsen


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Selectivity for lopsided foreign key columns

2015-12-17 Thread Mikkel Lauritsen

On 2015-12-17 16:23, Tom Lane wrote:

Mikkel Lauritsen <ren...@tala.dk> writes:

The schema contains two tables, t1 and t2.
t2 has two fields, an id and a tag, and it contains 146 rows that are
unique.
t1 has two fields, a value and a foreign key referring to t2.id, and 
it

contains 266177 rows.
The application retrieves the rows in t1 that match a specific tag in
t2, and it turned out that the contents of t1 were distributed in a 
very

lopsided way, where more than 90% of the rows refer to one of two tags
from t2:
...
The estimate for the number of rows in the result (1824) is way too 
low,

and that leads to bad plans and queries involving more joins on the
tables that run about 1000x slower than they should.



I have currently rewritten the application code to do two queries; one
to retrieve the id from t2 that matches the given tag and one to
retrieve the rows from t1, and that's a usable workaround but not
something we really like doing as a permanent solution. Fiddling with
the various statistics related knobs seems to make no difference, but 
is
there be some other way I can make Postgres assume high selectivity 
for

certain tag values? Am I just SOL with the given schema?


You're pretty much SOL.  Lacking cross-column statistics, the planner 
has

no idea which t2.id goes with the given tag, so it can't see that the
selected id is the one that is most common in t1.  You're getting a
join size estimate that is basically size of t1 divided by number of
possible values (146), which is about the best we can do without 
knowing

which id is selected.


--- snip --

Thanks - I thought as much, but it's really nice to have it confirmed 
from

people who are way more knowledgeable.

Best regards and thanks again,
  Mikkel Lauritsen


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Reasons for choosing one execution plan over another?

2013-09-12 Thread Mikkel Lauritsen
I wrote:

--- snip ---

 So - does anybody with enough insight in the planner know if it sounds
 likely that it would choose the given plans in these two cases, or if
 it's more likely that I have a tuning problem that leads to bad
 planning?

Duh. It suddenly dawned on me that I need to look closer at the plans...

The big difference in the estimated and actual row count in lines like

-  Nested Loop  (cost=0.00..250.78 rows=338 width=47) (actual
time=0.100..189.676 rows=187012 loops=1)

indicates that the planner is somehow mislead by the statistics on (at
least) one of the tables, right? Any suggestions as to how I go about
investigating that further?

One thing here that is slightly confusing is the relationship between
the estimated row count of 169 in the outer loop and 6059 in the last
index scan in the partial plan below. How do they relate to each other?

-  Nested Loop  (cost=0.00..452.10 rows=169 width=47) (actual
time=0.088..41.244 rows=32863 loops=1)
  -  Nested Loop  (cost=0.00..16.55 rows=1 width=39) (actual
time=0.031..0.035 rows=1 loops=1)
-  Index Scan using i_c_id on i  (cost=0.00..8.27 rows=1
width=39) (actual time=0.016..0.017 rows=1 loops=1)
  Index Cond: (c = 'bar'::text)
-  Index Scan using a_i_id_idx on a  (cost=0.00..8.27 rows=1
width=78) (actual time=0.012..0.013 rows=1 loops=1)
  Index Cond: (i_id = i.id)
  -  Index Scan using x_a_id_idx on x  (cost=0.00..374.95 rows=6059
width=86) (actual time=0.055..27.219 rows=32863 loops=1)
Index Cond: (a_id = a.id)


Best regards  thanks,
  Mikkel Lauritsen


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Reasons for choosing one execution plan over another?

2013-09-11 Thread Mikkel Lauritsen
Hi all,

I have a number of Postgres 9.2.4 databases with the same schema but with
slightly different contents, running on small servers that are basically
alike (8-16 GB ram).

When I run the same query on these databases it results in one of two
different execution plans where one is much faster (appx. 50 times) than
the other. Each database always gives the same plan, and vacuuming,
updating statistics and reindexing doesn't seem to make any difference.

Clearly the fast plan is preferred, but I haven't been able to identify
any pattern (table sizes, tuning etc.) in why one plan is chosen over the
other, so is there any way I can make Postgres tell me why it chooses to
plan the way it does?

As reference I have included the query and execution plans for two
different databases below. The x and e tables contain about 5 and 10
million records respectively, and the performance difference is perfectly
reasonable as the outer loop has to process 3576 rows in the fast case and
154149 rows in the slow case.

Best regards  thanks,
  Mikkel Lauritsen

---

Query:

SELECT x.r, e.id, a.id
FROM x
  INNER JOIN e ON x.id = e.id
  INNER JOIN a ON x.a_id = a.id
  INNER JOIN i ON a.i_id = i.id
WHERE e.h_id = 'foo' AND i.c = 'bar';

Fast plan:

 Nested Loop  (cost=0.00..24553.77 rows=1 width=86) (actual
time=2.810..102.451 rows=20 loops=1)
   Join Filter: (x.a_id = a.id)
   Rows Removed by Join Filter: 3556
   -  Nested Loop  (cost=0.00..16.55 rows=1 width=39) (actual
time=0.036..0.046 rows=3 loops=1)
 -  Index Scan using i_c_idx on i  (cost=0.00..8.27 rows=1
width=39) (actual time=0.019..0.020 rows=1 loops=1)
   Index Cond: (c = 'bar'::text)
 -  Index Scan using a_i_id_idx on a  (cost=0.00..8.27 rows=1
width=78) (actual time=0.014..0.021 rows=3 loops=1)
   Index Cond: (i_id = i.id)
   -  Nested Loop  (cost=0.00..24523.00 rows=1138 width=86) (actual
time=2.641..33.818 rows=1192 loops=3)
 -  Index Scan using e_h_id_idx on e  (cost=0.00..6171.55
rows=1525 width=39) (actual time=0.049..1.108 rows=1857 loops=3)
   Index Cond: (h_id = 'foo'::text)
 -  Index Scan using x_id_idx on x  (cost=0.00..12.02 rows=1
width=86) (actual time=0.017..0.017 rows=1 loops=5571)
   Index Cond: (id = e.id)
 Total runtime: 102.526 ms

Slow plan:

 Nested Loop  (cost=0.00..858.88 rows=1 width=86) (actual
time=89.430..2589.905 rows=11 loops=1)
   -  Nested Loop  (cost=0.00..448.38 rows=169 width=86) (actual
time=0.135..142.246 rows=154149 loops=1)
 -  Nested Loop  (cost=0.00..16.55 rows=1 width=39) (actual
time=0.056..0.064 rows=3 loops=1)
   -  Index Scan using i_c_idx on i  (cost=0.00..8.27 rows=1
width=39) (actual time=0.030..0.030 rows=1 loops=1)
 Index Cond: (c = 'bar'::text)
   -  Index Scan using a_i_id_idx on a  (cost=0.00..8.27
rows=1 width=78) (actual time=0.022..0.028 rows=3 loops=1)
 Index Cond: (i_id = i.id)
 -  Index Scan using x_a_id_idx on x  (cost=0.00..372.48
rows=5935 width=86) (actual time=0.065..35.479 rows=51383 loops=3)
   Index Cond: (a_id = a.id)
   -  Index Scan using e_pkey on e  (cost=0.00..2.42 rows=1 width=39)
(actual time=0.015..0.015 rows=0 loops=154149)
 Index Cond: (id = x.id)
 Filter: (h_id = 'foo'::text)
 Rows Removed by Filter: 1
 Total runtime: 2589.970 ms



-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Reasons for choosing one execution plan over another?

2013-09-11 Thread Mikkel Lauritsen
Hi all,

On Wed, 11 Sep 2013 18:55:38 +0200, Giuseppe Broccolo
giuseppe.brocc...@2ndquadrant.it wrote:
 Il 11/09/2013 13:16, Mikkel Lauritsen ha scritto:
  Hi all,
 
  I have a number of Postgres 9.2.4 databases with the same schema but
  with slightly different contents, running on small servers that are
  basically alike (8-16 GB ram).

 I think that your answer can be found in your statement slightly 
 different contents.

Yup, that's what I've been thinking myself so far - it definitely doesn't
look as if I'm hitting a bug, and I've been through the schema so I feel
reasonably sure that I'm not missing an index.

In the example from my original mail the i and a tables are identical in
the two databases. The slow plan is chosen when the x and e tables contain
3.2M and 6.2M rows, the fast plan has 12.8M and 17M rows.

So - does anybody with enough insight in the planner know if it sounds
likely that it would choose the given plans in these two cases, or if
it's more likely that I have a tuning problem that leads to bad planning?

And if the different plans are to be expected, is there any way I can hint
at the planner to make it choose the fast plan in both cases?

FWIW if I do a very simple query like

SELECT e.id FROM e INNER JOIN x USING (id) WHERE e.h_id = 'foo';

on the two databases I also end up with two different plans (included
below). Here the execution time difference is much less pronounced (note
that the fast execution is on inferior hardware and with a much larger
result), but the way the join is executed is the same as in the initial
larger plans. Setting seq_page_cost and random_page_cost to the same
value makes the planner choose the fast plan in both cases, but unfortu-
nately that has no effect on my initial problem :-/

Best regards  thanks,
  Mikkel Lauritsen

---

Fast plan:

 Nested Loop  (cost=0.00..24523.00 rows=1138 width=39) (actual
time=2.546..33.858 rows=1192 loops=1)
   Buffers: shared hit=8991
   -  Index Scan using e_h_id_idx on e  (cost=0.00..6171.55 rows=1525
width=39) (actual time=0.053..1.211 rows=1857 loops=1)
 Index Cond: (healthtrack_id =
'-95674114670403931535179954575983492851'::text)
 Buffers: shared hit=350
   -  Index Only Scan using x_pkey on x  (cost=0.00..12.02 rows=1
width=39) (actual time=0.017..0.017 rows=1 loops=1857)
 Index Cond: (id = e.id)
 Heap Fetches: 1192
 Buffers: shared hit=8641
 Total runtime: 34.065 ms


Slow plan:

 Nested Loop  (cost=22.25..7020.66 rows=277 width=39) (actual
time=0.298..13.996 rows=228 loops=1)
   Buffers: shared hit=3173
   -  Bitmap Heap Scan on e  (cost=22.25..2093.50 rows=537 width=39)
(actual time=0.219..0.628 rows=697 loops=1)
 Recheck Cond: (healthtrack_id = 'foo'::text)
 Buffers: shared hit=152
 -  Bitmap Index Scan on e_h_id_idx  (cost=0.00..22.12 rows=537
width=0) (actual time=0.188..0.188 rows=697 loops=1)
   Index Cond: (h_id = 'foo'::text)
   Buffers: shared hit=9
   -  Index Only Scan using x_pkey on x  (cost=0.00..9.17 rows=1
width=39) (actual time=0.018..0.018 rows=0 loops=697)
 Index Cond: (id = e.id)
 Heap Fetches: 228
 Buffers: shared hit=3021
 Total runtime: 14.070 ms


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Different execution plans for semantically equivalent queries

2011-02-06 Thread Mikkel Lauritsen
Hi all,

I have a execution planner related issue that I'd like to have some help
in understanding a bit deeper -

I have a table which basically contains fields for a value, a timestamp
and
a record type which is an integer. I would like to do a query which
retrieves
the newest record for each type, and the persistence framework that I'm
using
does something which is structurally like

SELECT * FROM table t1 WHERE 0 = (SELECT COUNT(*) FROM table t2 WHERE
t2.type = t1.type AND t2.timestamp  t1.timestamp)

On all of the PostgreSQL versions I've tried (9.0.2 included) this
executes
in about 20-30 seconds, which isn't exactly stellar. I've tried the (I
think)
equivalent

SELECT * FROM table t1 WHERE NOT EXISTS (SELECT * FROM table t2 WHERE
t2.type = t1.type AND t2.timestamp  t1.timestamp)

instead, and that executes in about 100 ms, so it's about 200 times
faster.

The two statements have completely different execution plans, so I
understand
why there is a difference in performance, but as I'm unable to modify the
SQL that the persistence framework generates I'd like to know if there's
anything that I can do in order to make the first query execute as fast as
the second one.

I'm more specifically thinking whether I'm missing out on a crucial
planner
configuration knob or something like that, which causes the planner to
treat the two cases differently.

Best regards  thanks for an excellent database engine,
  Mikkel Lauritsen


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Different execution plans for semantically equivalent queries

2011-02-06 Thread Mikkel Lauritsen
Hi Tom et al,

Many thanks for your prompt reply - you wrote:

 SELECT * FROM table t1 WHERE 0 = (SELECT COUNT(*) FROM table t2 WHERE
 t2.type = t1.type AND t2.timestamp  t1.timestamp)
 
 I suspect that *any* database is going to have trouble optimizing that.

Okay, I expected that much.

Just out of curiosity I've been looking a bit at the optimizer code
in PostgreSQL, and it seems as if it would be at least theoretically
possible to add support for things like transforming the query at
hand into the NOT EXISTS form; a bit like how = NULL is converted
to IS NULL.

Would a change like that be accepted, or would you rather try to
indirectly educate people into writing better SQL?

 You'd be well advised to lobby the persistence framework's authors to
 produce less brain-dead SQL.  The NOT EXISTS formulation seems to
 express what's wanted much less indirectly.

Will do :-)

For now I guess I'll hack it by wrapping a proxy around the JDBC
driver and rewriting the SQL on the fly; I encounter other bugs in
the persistence layer that are probably best handled that way as
well.

Best regards  thanks,
  Mikkel

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance