Hi,

I did some initial performance evaluation of this patch today, and I see a clear improvement on larger joins. The scenario I've chosen for the experiments is a simple fact-dimension join, i.e. a small table referenced by a large table. So effectively something like this:

CREATE TABLE dim  (id INT PRIMARY KEY, ...);
CREATE TABLE fact (dim_d INT REFERENCES dim(id), ...);

except that I haven't used the foreign key constraint. In all the experiments I've used a fact table 10x the size of the dimension, but I believe what really matters most is the size of the dimension (and how the hash table fits into the L2/L3 cache).

The query tested is very simple:

    select count(1) from (
        select * from f join d on (f.fact_id = d.dim_id)
    ) foo;

The outer aggregation is intentional - the join produces many rows and formatting them as string would completely eliminate any gains from the patch (even with "\o /dev/null" or such).

The following numbers come current master, running on E5-2630 v3 (2.40GHz), 64GB of RAM and this configuration:

 checkpoint_timeout = 15min
 effective_cache_size = 48GB
 maintenance_work_mem = 1GB
 max_wal_size = 4GB
 min_wal_size = 1GB
 random_page_cost = 1.5
 shared_buffers = 4GB
 work_mem = 1GB

all the other values are set to default.

I did 10 runs for each combination of sizes - the numbers seem quite consistent and repeatable. I also looked at the median values.


dim 100k rows, fact 1M rows
---------------------------

         master     patched
        -------     -------
   1    286.184     265.489
   2    284.827     264.961
   3    281.040     264.768
   4    280.926     267.720
   5    280.984     261.348
   6    280.878     261.463
   7    280.875     261.338
   8    281.042     261.265
   9    281.003     261.236
  10    280.939     261.185
        -------     -------
 med    280.994     261.406 (-7%)


dim 1M rows, fact 10M rows
--------------------------

         master     patched
       --------    --------
   1   4316.235    3690.373
   2   4399.539    3738.097
   3   4360.551    3655.602
   4   4359.763    3626.142
   5   4361.821    3621.941
   6   4359.205    3654.835
   7   4371.438    3631.212
   8   4361.857    3626.237
   9   4357.317    3676.651
  10   4359.561    3641.830
       --------    --------
 med   4360.157    3648.333 (-17%)


dim 10M rows, fact 100M rows
----------------------------

         master     patched
       --------    --------
   1  46246.467   39561.597
   2  45982.937   40083.352
   3  45818.118   39674.661
   4  45716.281   39616.585
   5  45651.117   40463.966
   6  45979.036   41395.390
   7  46045.400   41358.047
   8  45978.698   41253.946
   9  45801.343   41156.440
  10  45720.722   41374.688
      ---------   ---------
 med  45898.408   40810.203 (-10%)


So the gains seem quite solid - it's not something that would make the query an order of magnitude faster, but it's well above the noise.

Of course, in practice the queries will be more complicated, making the improvement less significant, but I don't think that's a reason not to apply it.

Two minor comments on the patch:

1) the 'subquery' variable in specialjoin_is_unique_join is unused

2) in the explain output, there should probably be a space before the
   '(inner unique)' text, so

     Hash Join (inner unique) ...

   instead of

     Hash Join(inner unique)

but that's just nitpicking at this point. Otherwise the patch seems quite solid to me.

regard

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment: unijoins-test.sql
Description: application/sql

Attachment: unijoins-queries.sql
Description: application/sql

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to