Re: [GENERAL] Equivalence Classes when using IN

2017-10-11 Thread David Rowley
On 12 October 2017 at 10:15, Kim Rose Carlsen  wrote:
> Why don't I see that predicate (customer_id) pushed into the outer nested 
> loop so we don't have to sort the whole table on each loop.
>
> (See original post and follow up for definitions)
> QUERY PLAN
> -
> Nested Loop Left Join  (cost=139.00..10392.96 rows=668 width=16) (actual 
> time=0.528..35.120 rows=200 loops=1)
>   Join Filter: (c.customer_id = product.customer_id)
>   Rows Removed by Join Filter: 199900
>   ->  Nested Loop  (cost=0.28..199.21 rows=334 width=12) (actual 
> time=0.075..1.146 rows=100 loops=1)
> ->  Seq Scan on customer  (cost=0.00..21.51 rows=334 width=8) (actual 
> time=0.067..0.282 rows=100 loops=1)
>   Filter: (age < 20)
>   Rows Removed by Filter: 901
> ->  Index Only Scan using customer_pkey on customer c  
> (cost=0.28..0.53 rows=1 width=4) (actual time=0.006..0.006 rows=1 loops=100)
>   Index Cond: (customer_id = customer.customer_id)
>   Heap Fetches: 100
>   ->  Materialize  (cost=138.73..173.75 rows=2001 width=8) (actual 
> time=0.005..0.130 rows=2001 loops=100)
> ->  Sort  (cost=138.73..143.73 rows=2001 width=8) (actual 
> time=0.448..0.588 rows=2001 loops=1)
>   Sort Key: product.customer_id, product.product_id
>   Sort Method: quicksort  Memory: 142kB
>   ->  Seq Scan on product  (cost=0.00..29.01 rows=2001 width=8) 
> (actual time=0.006..0.215 rows=2001 loops=1)
> Planning time: 0.214 ms
> Execution time: 35.284 ms

I don't really see any blockers that would mean we couldn't support
this, it's just that we don't currently support it. The predicates
that we do pushdown are just ones we deem as safe to pushdown of the
ones that appear in the query, or ones that can be derived through
equivalence. (e.g. ab.a = ab.b and ab.b = 1 --> ab.a = 1)

For example, consider the difference between the following:

create table ab(a int, b int);
insert into ab select x,x from generate_series(1,100)x;
create index on ab(a);
create index on ab(b);

postgres=# explain select * from (select distinct on (a) a,b from ab
order by a,b) ab where ab.b < 10;
QUERY PLAN
---
 Subquery Scan on ab  (cost=127757.34..145257.34 rows=33 width=8)
   Filter: (ab.b < 10)
   ->  Unique  (cost=127757.34..132757.34 rows=100 width=8)
 ->  Sort  (cost=127757.34..130257.34 rows=100 width=8)
   Sort Key: ab_1.a, ab_1.b
   ->  Seq Scan on ab ab_1  (cost=0.00..14425.00
rows=100 width=8)
(6 rows)


postgres=# explain select * from (select distinct on (a) a,b from ab
order by a,b) ab where ab.a < 10;
  QUERY PLAN
---
 Unique  (cost=8.73..8.77 rows=9 width=8)
   ->  Sort  (cost=8.73..8.75 rows=9 width=8)
 Sort Key: ab.a, ab.b
 ->  Index Scan using ab_a_idx on ab  (cost=0.42..8.58 rows=9 width=8)
   Index Cond: (a < 10)
(5 rows)

The "a < 10" was pushed down as we're distinct on (a), but pushing
down "ab.b < 10" would be invalid and could cause wrong results.

The predicate you'd like to see pushed down is actually a parameter in
a parameterized Path and we don't currently generate any parameterized
paths outside of each query level. Likely there's no good reason for
this other than it's not been done yet, but it's really only been
since 9.6 that the query planner has been flexible enough to possibly
allow something like this to be done at all.

The reason the planner may appear to push down the predicate when
there's no DISTINCT ON clause is that the planner was able to pull the
subquery (or view) up a level. When the planner is able to do this
it's much more flexible to the types of plans it can generate. It's
just that we don't ever pull up subqueries with DISTINCT ON, plus a
bunch of other reasons.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Equivalence Classes when using IN

2017-10-11 Thread Kim Rose Carlsen

> On 11 Oct 2017, at 21.46, David Rowley  wrote:
> 
>> On 12 October 2017 at 08:37, Kim Rose Carlsen  wrote:
>> 
>>> Yeah.  The ORDER BY creates a partial optimization fence, preventing
>>> any such plan from being considered.
 
>> 
>> I can see in the general case it semanticly means different things If you 
>> allow the WHERE to pass through ORDER BY.
>> 
>> A special case can be allowed for WHERE to pass the ORDER BY if the column 
>> is part of DISTINCT ON.
> 
> Yeah, we do allow predicates to be pushed down in that case.
> 

Let's ignore that it's not a very useful query I have written. 

Why don't I see that predicate (customer_id) pushed into the outer nested loop 
so we don't have to sort the whole table on each loop. 

(See original post and follow up for definitions) 
QUERY PLAN  

-
Nested Loop Left Join  (cost=139.00..10392.96 rows=668 width=16) (actual 
time=0.528..35.120 rows=200 loops=1)
  Join Filter: (c.customer_id = product.customer_id)
  Rows Removed by Join Filter: 199900
  ->  Nested Loop  (cost=0.28..199.21 rows=334 width=12) (actual 
time=0.075..1.146 rows=100 loops=1)
->  Seq Scan on customer  (cost=0.00..21.51 rows=334 width=8) (actual 
time=0.067..0.282 rows=100 loops=1)
  Filter: (age < 20)
  Rows Removed by Filter: 901
->  Index Only Scan using customer_pkey on customer c  (cost=0.28..0.53 
rows=1 width=4) (actual time=0.006..0.006 rows=1 loops=100)
  Index Cond: (customer_id = customer.customer_id)
  Heap Fetches: 100
  ->  Materialize  (cost=138.73..173.75 rows=2001 width=8) (actual 
time=0.005..0.130 rows=2001 loops=100)
->  Sort  (cost=138.73..143.73 rows=2001 width=8) (actual 
time=0.448..0.588 rows=2001 loops=1)
  Sort Key: product.customer_id, product.product_id
  Sort Method: quicksort  Memory: 142kB
  ->  Seq Scan on product  (cost=0.00..29.01 rows=2001 width=8) 
(actual time=0.006..0.215 rows=2001 loops=1)
Planning time: 0.214 ms
Execution time: 35.284 ms

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Equivalence Classes when using IN

2017-10-11 Thread David Rowley
On 12 October 2017 at 08:37, Kim Rose Carlsen  wrote:
>
>> Yeah.  The ORDER BY creates a partial optimization fence, preventing
>> any such plan from being considered.
>>>
>
> I can see in the general case it semanticly means different things If you 
> allow the WHERE to pass through ORDER BY.
>
> A special case can be allowed for WHERE to pass the ORDER BY if the column is 
> part of DISTINCT ON.

Yeah, we do allow predicates to be pushed down in that case.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Equivalence Classes when using IN

2017-10-11 Thread Kim Rose Carlsen

> Yeah.  The ORDER BY creates a partial optimization fence, preventing
> any such plan from being considered.
>> 

I can see in the general case it semanticly means different things If you allow 
the WHERE to pass through ORDER BY. 

A special case can be allowed for WHERE to pass the ORDER BY if the column is 
part of DISTINCT ON.




-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Equivalence Classes when using IN

2017-10-10 Thread Nico Williams
On Mon, Oct 09, 2017 at 07:44:50PM -0400, Tom Lane wrote:
> David Rowley  writes:
> > If the only reason that is_simple_subquery() rejects subqueries with
> > ORDER BY is due to wanting to keep the order by of a view, then
> > couldn't we make is_simple_subquery() a bit smarter and have it check
> > if the subquery is going to be joined to something else, which likely
> > would destroy the order, or at least it would remove any guarantees of
> > it.
> 
> I'm not on board with this.  The assumption is that if the user put an
> ORDER BY there, that means they want that subquery to be computed in that
> order.  It's not for us to decide they didn't mean what they said.
> 
> Moreover, there are cases where the ORDER BY would be semantically
> significant, eg if there's a LIMIT or volatile functions or tSRFs
> involved.

Or where the order is meaningful to an aggregate function applied to
columns of a view result set.  I'm not sure what the full set of cases
where the ORDER BY on the inner query is meaningful, but I'm sure there
are cases it is not.

If there are no such constraints on dropping the ORDER BY, then the it
could be dropped, making the view query simpler.

Nico
-- 


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Equivalence Classes when using IN

2017-10-10 Thread Kim Rose Carlsen
> If the only reason that is_simple_subquery() rejects subqueries with
> ORDER BY is due to wanting to keep the order by of a view, then
> couldn't we make is_simple_subquery() a bit smarter and have it check
> if the subquery is going to be joined to something else, which likely
> would destroy the order, or at least it would remove any guarantees of
> it.
>
> Something like the attached?

I dont know if it makes any difference that the ORDER BY is used in a
DISTINCT ON clause. In this case the ORDER BY is important.


- Kim


Re: [GENERAL] Equivalence Classes when using IN

2017-10-10 Thread Kim Rose Carlsen
> You would benefit from adding the age column to view_customer, or at
> least consider having some view which contains all the columns you'll
> ever need from those tables and if you need special views with only a
> subset of columns due to some software doing "select * from
> viewname;", then you could just create some. Joining to the same table
> again seems like a bit of a waste of effort for the planner and
> executor.

I would argue that the anti pattern would be the software that
insist on using "select * from viewname;" from a view that has
calculated columns that you do not care for. I recommend
introducing both lightweight views and heavyweight views, so you
can join up probably for what you need.

My example is fabricated trying to simplify things, but I seem to
create more confusion than clarity in my example. My point was
only to see if anything could be added to the fabricated
execution path. I agree that the listed example does not make
sense. So I will try and give some more context to real use
cases.

Imagine an invoice entity where you have one relation for invoice
base data and a relation for invoice_line. The invoice has some
invoice_id, customer_id, due_date, paid_date and invoice_line
contains each line with a invoice_id, display_name, amount. A
view (view_invoice_with_amount) where you calculate the total.

so a query could be
SELECT c.customer_id,
   i.invoice_amount_total
  FROM view_customer c
  JOIN view_invoice_with_amount i
ON c.customer_id = i.customer_id
 WHERE c.first_name = 'John';

If you ever need to filter by invoice_amount_total, it might be
necesary denormalize the relations and cache the amount in the
invoice table.

> I'd assume customer_id is the PRIMARY KEY of customer and
> is unique.

This is a continuation of the previous example, maybe I should have
included it all to make it more clear. But customer_id is declared
as a primary key.

> It's not all that clear what your view is doing here. Confusingly
> there's a Sort in the plan, yet nothing in the query asked for that,
> so I guess that the view must have an ORDER BY. If you get rid of that
> the planner would likely use an index on product (customer_id) to
> parameterise the nested loop, at least, it likely would, if you have
> one.

The view is defined in the original post. What I was trying to illustrate
was a DISTINCT ON clause to prioritize multiple products pr customer
to a somewhat "main" product for the customer. The ORDER BY on product_id
would in this case then map the first product a customer gets to its
"main" product. It could also be the most valuable product or newest ordered
active product etc. It is just some way of mapping one to many relation to a
one to one. Again the example is simplified and fabricated and maybe looses
its power to explain its intents.

> It's pretty bad practice to have ORDER BY in views. I kinda wish we
> didn't even allow it, but that ship sailed many years ago...

It is required by DISTINCT ON and as soon as you go into
reporting, datawarehouse then it gets difficult to avoid these
along with group by. Instead of writing each query from the
ground up you get a huge benefit by factorizing each query into
meaningful entities that can stand alone and make sense by
themself, and from these build up the query to answer your
questions. That way you gain lots of re-use of code and
definition doesn't change between queries. The down side is it
leaves alot of work to the planner. It's a trade off between
optimization, readability and simplicity.

I hope I make more sense now.

- Kim


Re: [GENERAL] Equivalence Classes when using IN

2017-10-09 Thread David Rowley
On 10 October 2017 at 12:44, Tom Lane  wrote:
> David Rowley  writes:
>> If the only reason that is_simple_subquery() rejects subqueries with
>> ORDER BY is due to wanting to keep the order by of a view, then
>> couldn't we make is_simple_subquery() a bit smarter and have it check
>> if the subquery is going to be joined to something else, which likely
>> would destroy the order, or at least it would remove any guarantees of
>> it.
>
> I'm not on board with this.  The assumption is that if the user put an
> ORDER BY there, that means they want that subquery to be computed in that
> order.  It's not for us to decide they didn't mean what they said.
>
> Moreover, there are cases where the ORDER BY would be semantically
> significant, eg if there's a LIMIT or volatile functions or tSRFs
> involved.

Ok, thanks for looking, although, FWIW, LIMIT and tSRFs are still disabled.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Equivalence Classes when using IN

2017-10-09 Thread Tom Lane
David Rowley  writes:
> If the only reason that is_simple_subquery() rejects subqueries with
> ORDER BY is due to wanting to keep the order by of a view, then
> couldn't we make is_simple_subquery() a bit smarter and have it check
> if the subquery is going to be joined to something else, which likely
> would destroy the order, or at least it would remove any guarantees of
> it.

I'm not on board with this.  The assumption is that if the user put an
ORDER BY there, that means they want that subquery to be computed in that
order.  It's not for us to decide they didn't mean what they said.

Moreover, there are cases where the ORDER BY would be semantically
significant, eg if there's a LIMIT or volatile functions or tSRFs
involved.

BTW, I noticed that I was wrong upthread about ORDER BY in subqueries
being disallowed by spec --- that was true originally, but they allow
it as of SQL:2008 or thereabouts.  It might be interesting to see if
the spec says anything concrete about the semantics of that.

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Equivalence Classes when using IN

2017-10-09 Thread David Rowley
On 10 October 2017 at 02:51, Tom Lane  wrote:
> David Rowley  writes:
>> It's pretty bad practice to have ORDER BY in views. I kinda wish we
>> didn't even allow it, but that ship sailed many years ago...
>
> I think it's actually disallowed by the SQL spec (although so are
> many other things we support).  IMO it's a useful facility to have
> for views that are meant for direct presentation to clients ---
> but if you'd like joins to the view to be optimized, you don't
> want an ORDER BY in there.

If the only reason that is_simple_subquery() rejects subqueries with
ORDER BY is due to wanting to keep the order by of a view, then
couldn't we make is_simple_subquery() a bit smarter and have it check
if the subquery is going to be joined to something else, which likely
would destroy the order, or at least it would remove any guarantees of
it.

Something like the attached?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


pullup_subqueries_with_order_by_when_its_not_the_only_fromitem.patch
Description: Binary data

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Equivalence Classes when using IN

2017-10-09 Thread Tom Lane
David Rowley  writes:
> It's not all that clear what your view is doing here. Confusingly
> there's a Sort in the plan, yet nothing in the query asked for that,
> so I guess that the view must have an ORDER BY. If you get rid of that
> the planner would likely use an index on product (customer_id) to
> parameterise the nested loop, at least, it likely would, if you have
> one.

Yeah.  The ORDER BY creates a partial optimization fence, preventing
any such plan from being considered.

> It's pretty bad practice to have ORDER BY in views. I kinda wish we
> didn't even allow it, but that ship sailed many years ago...

I think it's actually disallowed by the SQL spec (although so are
many other things we support).  IMO it's a useful facility to have
for views that are meant for direct presentation to clients ---
but if you'd like joins to the view to be optimized, you don't
want an ORDER BY in there.

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Equivalence Classes when using IN

2017-10-09 Thread David Rowley
On 9 October 2017 at 22:39, Kim Rose Carlsen  wrote:
> EXPLAIN ANALYZE
>  SELECT *
>FROM customer
>JOIN view_customer
>  ON customer.customer_id = view_customer.customer_id
>   WHERE age < 20;
>
>  QUERY PLAN
> -
>  Nested Loop Left Join  (cost=139.00..10392.96 rows=668 width=16) (actual 
> time=0.528..35.120 rows=200 loops=1)
>Join Filter: (c.customer_id = product.customer_id)
>Rows Removed by Join Filter: 199900
>->  Nested Loop  (cost=0.28..199.21 rows=334 width=12) (actual 
> time=0.075..1.146 rows=100 loops=1)
>  ->  Seq Scan on customer  (cost=0.00..21.51 rows=334 width=8) 
> (actual time=0.067..0.282 rows=100 loops=1)
>Filter: (age < 20)
>Rows Removed by Filter: 901
>  ->  Index Only Scan using customer_pkey on customer c  
> (cost=0.28..0.53 rows=1 width=4) (actual time=0.006..0.006 rows=1 loops=100)
>Index Cond: (customer_id = customer.customer_id)
>Heap Fetches: 100
>->  Materialize  (cost=138.73..173.75 rows=2001 width=8) (actual 
> time=0.005..0.130 rows=2001 loops=100)
>  ->  Sort  (cost=138.73..143.73 rows=2001 width=8) (actual 
> time=0.448..0.588 rows=2001 loops=1)
>Sort Key: product.customer_id, product.product_id
>Sort Method: quicksort  Memory: 142kB
>->  Seq Scan on product  (cost=0.00..29.01 rows=2001 width=8) 
> (actual time=0.006..0.215 rows=2001 loops=1)
>  Planning time: 0.214 ms
>  Execution time: 35.284 ms

You would benefit from adding the age column to view_customer, or at
least consider having some view which contains all the columns you'll
ever need from those tables and if you need special views with only a
subset of columns due to some software doing "select * from
viewname;", then you could just create some. Joining to the same table
again seems like a bit of a waste of effort for the planner and
executor.  I'd assume customer_id is the PRIMARY KEY of customer and
is unique.

It's not all that clear what your view is doing here. Confusingly
there's a Sort in the plan, yet nothing in the query asked for that,
so I guess that the view must have an ORDER BY. If you get rid of that
the planner would likely use an index on product (customer_id) to
parameterise the nested loop, at least, it likely would, if you have
one.

It's pretty bad practice to have ORDER BY in views. I kinda wish we
didn't even allow it, but that ship sailed many years ago...

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Equivalence Classes when using IN

2017-10-09 Thread Kim Rose Carlsen

>On 9 October 2017 at 08:01, Kim Rose Carlsen  wrote:


>> Is this because postgres never consider IN clause when building equivalence
>> class's?
>
>Only btree equality operators are considered at the moment.

After good night sleep and reading the previous discussion, I am no longer sure 
I have reduced my original problem to the right example. If we continue from 
previous setup and add the following:

   ALTER TABLE customer ADD COLUMN age INTEGER;
   UPDATE customer SET age = customer_id / 5;

CREATE INDEX ON customer (age);
CREATE INDEX ON product (customer_id);

   SET enable_hashjoin = false;
   SET enable_mergejoin = false;
    
    EXPLAIN ANALYZE
 SELECT * 
   FROM customer 
   JOIN view_customer 
 ON customer.customer_id = view_customer.customer_id 
  WHERE age < 20;

 QUERY PLAN 
 
-
 Nested Loop Left Join  (cost=139.00..10392.96 rows=668 width=16) (actual 
time=0.528..35.120 rows=200 loops=1)
   Join Filter: (c.customer_id = product.customer_id)
   Rows Removed by Join Filter: 199900
   ->  Nested Loop  (cost=0.28..199.21 rows=334 width=12) (actual 
time=0.075..1.146 rows=100 loops=1)
 ->  Seq Scan on customer  (cost=0.00..21.51 rows=334 width=8) (actual 
time=0.067..0.282 rows=100 loops=1)
   Filter: (age < 20)
   Rows Removed by Filter: 901
 ->  Index Only Scan using customer_pkey on customer c  
(cost=0.28..0.53 rows=1 width=4) (actual time=0.006..0.006 rows=1 loops=100)
   Index Cond: (customer_id = customer.customer_id)
   Heap Fetches: 100
   ->  Materialize  (cost=138.73..173.75 rows=2001 width=8) (actual 
time=0.005..0.130 rows=2001 loops=100)
 ->  Sort  (cost=138.73..143.73 rows=2001 width=8) (actual 
time=0.448..0.588 rows=2001 loops=1)
   Sort Key: product.customer_id, product.product_id
   Sort Method: quicksort  Memory: 142kB
   ->  Seq Scan on product  (cost=0.00..29.01 rows=2001 width=8) 
(actual time=0.006..0.215 rows=2001 loops=1)
 Planning time: 0.214 ms
 Execution time: 35.284 ms


The planner prefer to use hash and merge joins which is ok, when many rows are 
to be joined, I don't think any condition can be merged to make these case 
faster. I have disabled merge and hash joins to get to a nested loop join 
instead, in this case it would be much better if customer_id can be pulled 
inside the loop, so it can look at only the relevant rows and not all rows for 
each loop. I somehow inferred that this would be the same as selecting from the 
view using IN clause, now I'm not so sure anymore.

I can see there is a trade off between planner time and how exotic the case is. 
If you want to be able to hide abstraction through views I guess the nature 
becomes more OLAP oriented than OLTP. 

Best Regards
Kim Carlsen

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Equivalence Classes when using IN

2017-10-08 Thread Tom Lane
David Rowley  writes:
> There are more details of the discussion in [1], although there's
> probably lots more threads to be found if you search the archives.
> [1] 
> https://www.postgresql.org/message-id/flat/CAKJS1f9FK_X_5HKcPcSeimy16Owe3EmPmmGsGWLcKkj_rW9s6A%40mail.gmail.com#cakjs1f9fk_x_5hkcpcseimy16owe3empmmgsgwlckkj_rw9...@mail.gmail.com

That thread seems to be about transitively applying inequalities
(ie, given "a = b and a < c", deduce "b < c"), which seems like a bit
of a different animal than IN.  Some of the issues are similar
perhaps, but I'd think that being able to deduce "b IN somelist"
from "a = b and a IN somelist" is more valuable, simply because the
IN would typically be a much stronger constraint than an inequality.
So that idea suggests that it's more worth expending planner cycles
to chase the possibility.

I do vaguely recall discussions specifically around IN, though
I didn't have any luck finding one in the archives.  There's also
the recent thread
https://www.postgresql.org/message-id/flat/camjna7cc4x9yr-vajs-jsycajhrdvjqnn7m2slh1wlh-_z2...@mail.gmail.com
which suggests being able to simplify "a IN somelist AND
a IN someotherlist".  If we wanted to do that, making the
"lists" be treatable as eclass members would be a good
place to start, because that would naturally result in
intersect-able lists ending up in the same eclass.

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Equivalence Classes when using IN

2017-10-08 Thread David Rowley
On 9 October 2017 at 08:01, Kim Rose Carlsen  wrote:
> Is this because postgres never consider IN clause when building equivalence
> class's?

Only btree equality operators are considered at the moment.

> Are there any interests in adding such rule?

There's been some discussion on it previously, although there is lots
to still be worked out, for example, it's not that clear if it will
always be a win to always apply the qual.

There are more details of the discussion in [1], although there's
probably lots more threads to be found if you search the archives.

[1] 
https://www.postgresql.org/message-id/flat/CAKJS1f9FK_X_5HKcPcSeimy16Owe3EmPmmGsGWLcKkj_rW9s6A%40mail.gmail.com#cakjs1f9fk_x_5hkcpcseimy16owe3empmmgsgwlckkj_rw9...@mail.gmail.com

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Equivalence Classes when using IN

2017-10-08 Thread Kim Rose Carlsen
Hi


I have this query where I think it's strange that the join doesn't pull the 
where condition in since RHS is equal to LHS. It might be easier to expain with 
an example

Setup
CREATE TABLE customer (
  customer_id INTEGER PRIMARY KEY
);

CREATE TABLE product (
  product_id  INTEGER PRIMARY KEY,
  customer_id INTEGER NOT NULL REFERENCES customer (customer_id)
);

INSERT INTO customer (SELECT generate_series FROM 
generate_series(0, 100));

INSERT INTO product (product_id, customer_id) (SELECT 
generate_series, generate_series / 2 FROM generate_series(0, 2000));


Query

   EXPLAIN ANALYSE
SELECT *
  FROM customer c
  JOIN (SELECT DISTINCT ON (customer_id) * FROM product ORDER BY 
customer_id, product_id) p
ON c.customer_id = p.customer_id
 WHERE c.customer_id IN (500, 501);

QUERY PLAN
--
 Merge Join  (cost=172.43..186.25 rows=1 width=12) (actual time=1.350..1.353 
rows=2 loops=1)
   Merge Cond: (c.customer_id = product.customer_id)
   ->  Sort  (cost=13.93..13.93 rows=2 width=4) (actual time=0.036..0.036 
rows=2 loops=1)
 Sort Key: c.customer_id
 Sort Method: quicksort  Memory: 25kB
 ->  Bitmap Heap Scan on customer c  (cost=8.58..13.92 rows=2 width=4) 
(actual time=0.026..0.027 rows=2 loops=1)
   Recheck Cond: (customer_id = ANY ('{500,501}'::integer[]))
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on customer_pkey  (cost=0.00..8.58 rows=2 
width=0) (actual time=0.018..0.018 rows=2 loops=1)
 Index Cond: (customer_id = ANY ('{500,501}'::integer[]))
   ->  Unique  (cost=158.51..169.81 rows=200 width=8) (actual time=0.783..1.221 
rows=502 loops=1)
 ->  Sort  (cost=158.51..164.16 rows=2260 width=8) (actual 
time=0.782..0.929 rows=1003 loops=1)
   Sort Key: product.customer_id, product.product_id
   Sort Method: quicksort  Memory: 142kB
   ->  Seq Scan on product  (cost=0.00..32.60 rows=2260 width=8) 
(actual time=0.015..0.366 rows=2001 loops=1)
 Planning time: 0.281 ms
 Execution time: 1.432 ms


I would expect that since c.customer_id = p.customer_id then p.customer_id IN 
(500, 501). If I apply this rule myself, I get a much nicer plan (and it could 
be even better with an index on product_id).


QUERY PLAN
--
 Merge Join  (cost=52.70..53.11 rows=1 width=12) (actual time=0.686..0.693 
rows=2 loops=1)
   Merge Cond: (product.customer_id = c.customer_id)
   ->  Unique  (cost=38.77..38.89 rows=22 width=8) (actual time=0.647..0.651 
rows=2 loops=1)
 ->  Sort  (cost=38.77..38.83 rows=23 width=8) (actual 
time=0.646..0.647 rows=4 loops=1)
   Sort Key: product.customer_id, product.product_id
   Sort Method: quicksort  Memory: 25kB
   ->  Seq Scan on product  (cost=0.00..38.25 rows=23 width=8) 
(actual time=0.331..0.632 rows=4 loops=1)
 Filter: (customer_id = ANY ('{500,501}'::integer[]))
 Rows Removed by Filter: 1997
   ->  Sort  (cost=13.93..13.93 rows=2 width=4) (actual time=0.033..0.033 
rows=2 loops=1)
 Sort Key: c.customer_id
 Sort Method: quicksort  Memory: 25kB
 ->  Bitmap Heap Scan on customer c  (cost=8.58..13.92 rows=2 width=4) 
(actual time=0.025..0.026 rows=2 loops=1)
   Recheck Cond: (customer_id = ANY ('{500,501}'::integer[]))
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on customer_pkey  (cost=0.00..8.58 rows=2 
width=0) (actual time=0.018..0.018 rows=2 loops=1)
 Index Cond: (customer_id = ANY ('{500,501}'::integer[]))
 Planning time: 0.386 ms
 Execution time: 0.774 ms
(19 rows)

Is this because postgres never consider IN clause when building equivalence 
class's?

Are there any interests in adding such rule?


My idea is to wrap this in a view

  CREATE VIEW view_customer AS
SELECT c.customer_id,
   p.product_id
  FROM customer c
 LEFT JOIN (SELECT DISTINCT ON (customer_id) * FROM product ORDER BY 
customer_id, product_id) p
ON c.customer_id = p.customer_id


Where the LEFT JOIN can be pruned if there is no explicit need for product_id. 
Here I loose the power to express that both c.customer_id and p.customer_id is 
the same.


Best regards

Kim Carlsen