Some idle thoughts has provoked me to sending this email. I'm wondering what
I'm seeing here, is a simple missed optimisation or something that would be
very difficult to implement, or even perhaps something I've completely
misunderstood.

 

To set the scene let's go with the old products and sales example. In this
example each product has been given a unique ID maybe due to perhaps
product_codes being a bit long and to keep the sales table as narrow as
possible.

 

If we wanted to see the sales per product we could write something like
this:

 

SELECT p.product_code,SUM(s.quantity)

FROM products p

INNER JOIN bigsalestable s ON p.productid = s.productid

GROUP BY p.product_code;

 

Though this plan might not be quite as optimal as it could be as it performs
the grouping after the join.

 

                                     QUERY PLAN

----------------------------------------------------------------------------
---------

HashAggregate  (cost=33195.13..33199.63 rows=450 width=150)

   ->  Hash Join  (cost=20.13..28195.13 rows=1000000 width=150)

         Hash Cond: (s.productid = p.productid)

         ->  Seq Scan on bigsalestable s  (cost=0.00..14425.00 rows=1000000
width=8)

         ->  Hash  (cost=14.50..14.50 rows=450 width=150)

               ->  Seq Scan on products p  (cost=0.00..14.50 rows=450
width=150)

(6 rows)

 

Of course the query could have been written in the first place as:

 

SELECT p.product_code,s.quantity

FROM products AS p

INNER JOIN (SELECT productid,SUM(quantity) AS quantity FROM bigsalestable
GROUP BY productid) AS s ON p.productid = s.productid;

 

And that would have given us a more efficient plan.

 

Of course, for these actual plans to be equivalent there would naturally
have to be a unique index on product_code in the products table.

I think I'm right in thinking that if a unique index exists to match the
group by clause, and the join condition is equality (probably using the same
operator class as the unique btree index?), then the grouping could be
pushed up to before the join.

 

In the attached script the hand optimised query runs about twice as fast
with 1 million record sales table.

 

The existence of join removal makes me think we don't want to always allow
it up to who or what is writing the query to allow the best choice in plan.

 

I'm not really skilled enough in the arts of postgresql's planner to look
into how hard it would be to implement this, but I thought I'd throw it out
there to collect some comments on the idea.

 

Regards

 

David Rowley

 

BEGIN WORK;

CREATE TABLE products (
  productid INT NOT NULL,
  product_code VARCHAR(64) NOT NULL,
  PRIMARY KEY(productid)
);

CREATE UNIQUE INDEX product_product_code_uidx ON products (product_code);

-- create small list of products
INSERT INTO products sELECT g.id,'ABC' || CAST(g.id AS TEXT) FROM 
generate_series(1,10) g(id);

CREATE TABLE bigsalestable (
  productid INT NOT NULL,
  quantity INT NOT NULL
);

INSERT INTO bigsalestable (productid,quantity)
SELECT productid,CAST(random() * 1000 AS INT)
FROM products
CROSS JOIN generate_series(1,100000);

COMMIT;

EXPLAIN
SELECT p.product_code,SUM(s.quantity)
FROM products p
INNER JOIN bigsalestable s ON p.productid = s.productid
GROUP BY p.product_code;

EXPLAIN
SELECT p.product_code,s.quantity
FROM products AS p
INNER JOIN (SELECT productid,SUM(quantity) AS quantity FROM bigsalestable GROUP 
BY productid) AS s ON p.productid = s.productid;




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to