Re: [PERFORM] Merge joins on index scans

2016-03-13 Thread Tom Lane
James Parks  writes:
> On Mon, Feb 29, 2016 at 5:22 PM, Tom Lane  wrote:
>> The other explain shows a scan of "a" reading about 490k rows and
>> returning 395 of them, so there's a factor of about 200 re-read here.
>> I wonder if the planner should have inserted a materialize node to
>> reduce that.
>> 
>> However, I think the real problem is upstream of that: if that indexscan
>> was estimated at 26163.20 units, how'd the mergejoin above it get costed
>> at only 7850.13 units?  The answer has to be that the planner thought the
>> merge would stop before reading most of "a", as a result of limited range
>> of b.a_id.  It would be interesting to look into what the actual maximum
>> b.a_id value is.

> I've attached a pg_dump of a database that contains all of the data, in the
> event you (or others) would like to look at it. The attachment is ~1.8MB
> (gzipped), and you can replay the pg_dump file on a database that has just
> been created with initdb.

Thanks for sending the test data --- I got a chance to look at this
finally.  It looks like my first suspicion was right and the second one
wrong: the planner is badly underestimating the amount of rescan required
and thus not putting in a Materialize buffer node where needed.

If I force a Materialize node to be put in, without changing any cost
estimates (basically, set path->materialize_inner = 1 at the end of
final_cost_mergejoin), then I get this:

 Sort  (cost=7343.28..7343.62 rows=137 width=16) (actual time=218.342..232.817 
rows=83427 loops=1)
   Sort Key: b.id
   Sort Method: external merge  Disk: 2120kB
   ->  Merge Join  (cost=696.93..7338.41 rows=137 width=16) (actual 
time=18.627..136.549 rows=83427 loops=1)
 Merge Cond: (b.a_id = a.id)
 ->  Index Scan using a_idx on b  (cost=0.29..2708.59 rows=83427 
width=16) (actual time=0.048..28.876 rows=83427 loops=1)
 ->  Materialize  (cost=0.42..24368.01 rows=805 width=8) (actual 
time=18.568..74.447 rows=83658 loops=1)
   ->  Index Scan using a_pkey on a  (cost=0.42..24366.00 rows=805 
width=8) (actual time=18.560..66.186 rows=238 loops=1)
 Filter: (nonce = 64)
 Rows Removed by Filter: 87955
 Execution time: 239.195 ms

which is a pretty satisfactory result in terms of runtime, though
still a bit slower than the hash alternative.

Now, there are 490166 "a" rows, but we can see that the inner indexscan
stopped after reading 87955+238 = 88193 of them.  So there really is an
early-stop effect and the planner seems to have gotten that about right.
The problem is the rescan effect, which we can now quantify at 83658/238
or about 350:1.  Despite which, stepping through final_cost_mergejoin
finds that it estimates *zero* rescan effect.  If I force the
rescannedtuples estimate to be the correct value of 83658 - 238 = 83420,
it properly decides that a materialize node ought to be put in; but that
also increases its overall cost estimate for this mergejoin to 7600, which
causes it to prefer doing the join in the other direction:

 Sort  (cost=7343.28..7343.62 rows=137 width=16) (actual time=169.579..184.184 
rows=83427 loops=1)
   Sort Key: b.id
   Sort Method: external merge  Disk: 2120kB
   ->  Merge Join  (cost=696.93..7338.41 rows=137 width=16) (actual 
time=16.460..108.562 rows=83427 loops=1)
 Merge Cond: (a.id = b.a_id)
 ->  Index Scan using a_pkey on a  (cost=0.42..24366.00 rows=805 
width=8) (actual time=16.442..63.197 rows=238 loops=1)
   Filter: (nonce = 64)
   Rows Removed by Filter: 87955
 ->  Index Scan using a_idx on b  (cost=0.29..2708.59 rows=83427 
width=16) (actual time=0.013..26.811 rows=83427 loops=1)
 Execution time: 190.441 ms

which is an even more satisfactory outcome; the cheaper merge direction
is properly estimated as cheaper, and this is actually a tad faster than
the hash join for me.

So the entire problem here is a bogus rescannedtuples estimate.  The code
comment about that estimate is

 * When there are equal merge keys in the outer relation, the mergejoin
 * must rescan any matching tuples in the inner relation. This means
 * re-fetching inner tuples; we have to estimate how often that happens.
 *
 * For regular inner and outer joins, the number of re-fetches can be
 * estimated approximately as size of merge join output minus size of
 * inner relation. Assume that the distinct key values are 1, 2, ..., and
 * denote the number of values of each key in the outer relation as m1,
 * m2, ...; in the inner relation, n1, n2, ...  Then we have
 *
 * size of join = m1 * n1 + m2 * n2 + ...
 *
 * number of rescanned tuples = (m1 - 1) * n1 + (m2 - 1) * n2 + ... = m1 *
 * n1 + m2 * n2 + ... - (n1 + n2 + ...) = size of join - size of inner
 * relation
 *
 * This equation works correctly for outer tuples having no inner match
 * (nk = 0), but not for inner 

Re: [PERFORM] Merge joins on index scans

2016-03-01 Thread James Parks
On Sun, Feb 28, 2016 at 2:06 AM, David Rowley 
wrote:

> On 27 February 2016 at 11:07, James Parks  wrote:
> >
> > CREATE TABLE a (id bigint primary key, nonce bigint);
> > CREATE TABLE b (id bigint primary key, a_id bigint not null);
> > CREATE INDEX a_idx ON b (a_id);
> >
> > The query:
> >
> > SELECT b.* FROM b JOIN a ON b.a_id = a.id WHERE a.nonce = ? ORDER BY
> b.id
> > ASC;
> >
> > (skip down to [1] and [2] to see the query performance)
> >
> > What I know:
> >
> > If you force the query planner to use a merge join on the above query, it
> > takes 10+ minutes to complete using the data as per below. If you force
> the
> > query planner to use a hash join on the same data, it takes ~200
> > milliseconds.
>
> I believe I know what is going on here, but can you please test;
>
> SELECT b.* FROM b WHERE EXISTS (SELECT 1 FROM a ON b.a_id = a.id AND
> a.nonce = ?) ORDER BY b.id ASC;
>
> using the merge join plan.
>
>
Here's the query plan for that query (slight modifications to get it to
run):

postgres=# explain (analyze,buffers) SELECT b.* FROM b WHERE EXISTS (SELECT
1 FROM a WHERE b.a_id = a.id AND a.nonce = 64) ORDER BY b.id ASC;
QUERY
PLAN
--
 Sort  (cost=16639.19..16855.62 rows=86572 width=16) (actual
time=145.117..173.298 rows=83427 loops=1)
   Sort Key: b.id
   Sort Method: external merge  Disk: 2112kB
   Buffers: shared hit=86193 read=881, temp read=269 written=269
   I/O Timings: read=3.199
   ->  Merge Semi Join  (cost=795.82..8059.09 rows=86572 width=16) (actual
time=6.680..91.862 rows=83427 loops=1)
 Merge Cond: (b.a_id = a.id)
 Buffers: shared hit=86193 read=881
 I/O Timings: read=3.199
 ->  Index Scan using a_idx on b  (cost=0.00..3036.70 rows=86572
width=16) (actual time=0.005..25.193 rows=83427 loops=1)
   Buffers: shared hit=1064 read=374
   I/O Timings: read=1.549
 ->  Index Scan using a_pkey on a  (cost=0.00..26259.85 rows=891
width=8) (actual time=6.663..35.177 rows=237 loops=1)
   Filter: (nonce = 64)
   Rows Removed by Filter: 87939
   Buffers: shared hit=85129 read=507
   I/O Timings: read=1.650
 Total runtime: 186.825 ms

... so, yes, it does indeed get a lot faster


> If this performs much better then the problem is due to the merge join
> mark/restore causing the join to have to transition through many
> tuples which don't match the a.nonce = ? predicate. The mark and
> restore is not required for the rewritten query, as this use a semi
> join rather than a regular inner join. With the semi join the executor
> knows that it's only meant to be matching a single tuple in "a", so
> once the first match is found it can move to the next row in the outer
> relation without having to restore the scan back to where it started
> matching that inner row again.
>
> If I'm right, to get around the problem you could; create index on a
> (nonce, id);
>
>
postgres=# CREATE INDEX a_id_nonce_idx ON a (nonce, id);
CREATE INDEX
postgres=# explain (analyze,buffers) SELECT b.* FROM b JOIN a ON b.a_id =
a.id WHERE a.nonce = 64 ORDER BY b.id ASC;
 QUERY
PLAN
-
 Sort  (cost=3376.57..3376.92 rows=140 width=16) (actual
time=144.340..160.875 rows=83427 loops=1)
   Sort Key: b.id
   Sort Method: external merge  Disk: 2120kB
   Buffers: shared hit=368 read=522, temp read=266 written=266
   ->  Merge Join  (cost=69.50..3371.58 rows=140 width=16) (actual
time=0.056..88.409 rows=83427 loops=1)
 Merge Cond: (b.a_id = a.id)
 Buffers: shared hit=368 read=522
 ->  Index Scan using a_idx on b  (cost=0.29..2724.58 rows=83427
width=16) (actual time=0.009..23.834 rows=83427 loops=1)
   Buffers: shared hit=171 read=518
 ->  Index Only Scan using a_id_nonce_idx on a  (cost=0.42..2450.58
rows=820 width=8) (actual time=0.041..20.776 rows=83658 loops=1)
   Index Cond: (nonce = 64)
   Heap Fetches: 83658
   Buffers: shared hit=197 read=4
 Planning time: 0.241 ms
 Execution time: 172.346 ms

Looks pretty fast to me. That being said, the number of rows returned by
the Index Only Scan seems a bit high, as compared to the results below, so
I added your patch below and got [2].


> If such an index is out of the question then a patch has been
> submitted for review which should fix this problem in (hopefully)
> either 9.6 or 9.7
> https://commitfest.postgresql.org/9/129/
> If you have a test environment handy, it would be nice if you could
> test the patch on the current git head to see if this fixes 

Re: [PERFORM] Merge joins on index scans

2016-02-29 Thread Tom Lane
David Rowley  writes:
> On 27 February 2016 at 11:07, James Parks  wrote:
>> If you force the query planner to use a merge join on the above query, it
>> takes 10+ minutes to complete using the data as per below. If you force the
>> query planner to use a hash join on the same data, it takes ~200
>> milliseconds.

> I believe I know what is going on here, but can you please test;
> SELECT b.* FROM b WHERE EXISTS (SELECT 1 FROM a ON b.a_id = a.id AND
> a.nonce = ?) ORDER BY b.id ASC;
> using the merge join plan.

> If this performs much better then the problem is due to the merge join
> mark/restore causing the join to have to transition through many
> tuples which don't match the a.nonce = ? predicate.

Clearly we are rescanning an awful lot of the "a" table:

 ->  Index Scan using a_pkey on a  (cost=0.00..26163.20 rows=843 
width=8) (actual time=5.706..751385.306 rows=83658 loops=1)
   Filter: (nonce = 64)
   Rows Removed by Filter: 2201063696
   Buffers: shared hit=2151024418 read=340
   I/O Timings: read=1.015

The other explain shows a scan of "a" reading about 490k rows and
returning 395 of them, so there's a factor of about 200 re-read here.
I wonder if the planner should have inserted a materialize node to
reduce that.

However, I think the real problem is upstream of that: if that indexscan
was estimated at 26163.20 units, how'd the mergejoin above it get costed
at only 7850.13 units?  The answer has to be that the planner thought the
merge would stop before reading most of "a", as a result of limited range
of b.a_id.  It would be interesting to look into what the actual maximum
b.a_id value is.

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Merge joins on index scans

2016-02-28 Thread David Rowley
On 27 February 2016 at 11:07, James Parks  wrote:
>
> CREATE TABLE a (id bigint primary key, nonce bigint);
> CREATE TABLE b (id bigint primary key, a_id bigint not null);
> CREATE INDEX a_idx ON b (a_id);
>
> The query:
>
> SELECT b.* FROM b JOIN a ON b.a_id = a.id WHERE a.nonce = ? ORDER BY b.id
> ASC;
>
> (skip down to [1] and [2] to see the query performance)
>
> What I know:
>
> If you force the query planner to use a merge join on the above query, it
> takes 10+ minutes to complete using the data as per below. If you force the
> query planner to use a hash join on the same data, it takes ~200
> milliseconds.

I believe I know what is going on here, but can you please test;

SELECT b.* FROM b WHERE EXISTS (SELECT 1 FROM a ON b.a_id = a.id AND
a.nonce = ?) ORDER BY b.id ASC;

using the merge join plan.

If this performs much better then the problem is due to the merge join
mark/restore causing the join to have to transition through many
tuples which don't match the a.nonce = ? predicate. The mark and
restore is not required for the rewritten query, as this use a semi
join rather than a regular inner join. With the semi join the executor
knows that it's only meant to be matching a single tuple in "a", so
once the first match is found it can move to the next row in the outer
relation without having to restore the scan back to where it started
matching that inner row again.

If I'm right, to get around the problem you could; create index on a
(nonce, id);

If such an index is out of the question then a patch has been
submitted for review which should fix this problem in (hopefully)
either 9.6 or 9.7
https://commitfest.postgresql.org/9/129/
If you have a test environment handy, it would be nice if you could
test the patch on the current git head to see if this fixes your
problem. The findings would be quite interesting for me. Please note
this patch is for test environments only at this stage, not for
production use.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Merge joins on index scans

2016-02-26 Thread James Parks
Dear psql-performance,

I'm having issues with a certain query, and I was hoping you could help me
out.

The schema:

(start with new database cluster, with either SQL_ASCII or en.us-UTF8
encoding, using the default server configuration available in the pgdg
Jessie packages).

CREATE TABLE a (id bigint primary key, nonce bigint);
CREATE TABLE b (id bigint primary key, a_id bigint not null);
CREATE INDEX a_idx ON b (a_id);

The query:

SELECT b.* FROM b JOIN a ON b.a_id = a.id WHERE a.nonce = ? ORDER BY b.id
ASC;

(skip down to [1] and [2] to see the query performance)

What I know:

If you force the query planner to use a merge join on the above query, it
takes 10+ minutes to complete using the data as per below. If you force the
query planner to use a hash join on the same data, it takes ~200
milliseconds. This behavior is the same both on Postgresql 9.2.15 and
Postgresql 9.4.6 (at least as provided by the Debian Jessie repo hosted by
postgresql.org), and happens both on i386 and x86_64 platforms. Note: the
data for these queries is the same as described below. Let me know if you
want me to provide the raw csv's or similar.

Creating a new Postgresql 9.4 cluster (64-bit), creating the tables (a) and
(b), importing the table data into the tables (a) and (b), and then running
the above query using EXPLAIN results in a merge join query plan, as in [1].

Creating a new Postgresql 9.2 cluster (32-bit or 64-bit), creating the
tables (a) and (b), importing the table data into (a) and (b), and then
running the above query results in a hash join query plan, as in [2].

When running query [1], the postgres process on the machine consumes 100%
CPU for a long time (it seems CPU-bound).

What I expected:

I expected both of the hash join and merge join implementations of this
query to have comparable query times; perhaps within an order of magnitude.
This was expected on my part mostly because the cost metrics for each query
were very similar. Instead, the "optimal" query plan for the query takes
more than 1000x longer.

I also expected that the "Rows Removed by Filter: " for the index scan on
(a) would not have such a high count, as the number of rows in table (a)
(~500,000) is significantly less than the count (2,201,063,696).

What I want to know:

- Is this expected behavior? Can you describe how the merge join algorithm
achieves these results?
- Can I avoid this issue by disabling merge joins in the server
configuration?

Configuration:

The configuration of the database is the sample configuration as per the
Debian Jessie packages of Postgresql available at
http://apt.postgresql.org/pub/repos/apt/ with the exception that the data
directory was explicitly specified.

Information about the data:

Here are some queries that help describe the data I'm working with:

postgres=# select distinct a_id, count(*) from b group by a_id;
  a_id  | count
+---
  49872 |   320
  47994 | 5
  19084 | 82977
  53251 |   100
 109804 |10
  51738 | 5
  49077 |10

postgres=# select count(*) from b;
 count
---
 83427

postgres=# select count(distinct nonce) from a;
 count
---
   198

postgres=# select count(*) from a;
 count

 490166

postgres=# select count(*) from a where nonce = 64;
 count
---
   395

Hardware:

2015-era Intel Xeon processors
> 300 GB of ram (greater than the size of the database with a large margin)
database on hardware raid 1 array on 2 SSDs

Commentary:

This is my first bug report to a major open source project, so I apologize
in advance if I messed up this report. Let me know if I have left out key
details -- I'm happy to provide them.

Given that there are roughly 500k rows in table a, and given that the
EXPLAIN output claims that the filter (nonce = 64) caused 2 billion rows to
be skipped (refer to [1]) suggests that each row in table b is being
compared to a non-negligible number of rows in (a).

I can probably make this data available as a pg_dump file. Let me know if
you think that's necessary, and where I should upload it.

Regards,
James

[1]
postgres=# explain (analyze,buffers) select b.* from b join a on b.a_id =
a.id where a.nonce = 64 order by b.id asc;
 QUERY PLAN

-
 Sort  (cost=7855.25..7855.61 rows=143 width=16) (actual
time=752058.415..752080.731 rows=83427 loops=1)
   Sort Key: b.id
   Sort Method: external merge  Disk: 2096kB
   Buffers: shared hit=2151025721 read=479, temp read=267 written=267
   I/O Timings: read=2.384
   ->  Merge Join  (cost=869.07..7850.13 rows=143 width=16) (actual
time=5.718..751760.637 rows=83427 loops=1)
 Merge Cond: (b.a_id = a.id)
 Buffers: shared hit=2151025721 read=479
 I/O Timings: read=2.384
 ->  Index Scan using a_idx on b  (cost=0.00..2953.35 rows=83427
width=16) (actual time=0.007..68.165