Re: Self-join optimisation

2021-03-12 Thread Hywel Carver
Hi, thanks for your replies. I've tested the patch and it works for me in
the cases where I'd use it.

> And explain why your application is doing queries like this, and why it
can't be changed to changed to not generate such queries.

Reading the thread, it looks like some of the requests for this feature are
coming from people using ORMs that generate bad queries. That's not been my
experience - I've always been able to find a way to construct the right
query through the ORM or just write correct SQL. When I've wanted this
feature has always been in relation to combining views.

For example, I was recently helping out a company that runs a booking
system for leisure activities, and their database has a view for how many
staff are available on a given day to supervise a given activity (e.g.
surfing), and a view for how much equipment is available on a given day
(e.g. how many surfboards). They also have a view for the equipment
requirements for a given activity (e.g. some boat trips require a minimum
of 2 boats and 4 oars). When they want to make bookings, they have to
combine data from these views, and the tables that create them.

It would definitely be possible to write one view that had all of this data
in (and maintain the other views too, which are needed elsewhere in the
site). And it could be made wide to have all of the rows from the source
tables. But it would, to me, feel like much better code to keep the
separate decomposed views and join them together for the booking query.
Right now, that query's performance suffers in a way that this feature
would fix. So the current choices are: accept worse performance with
decomposed views, write one very large and performant view but duplicate
some of the logic, or use their ORM to generate the SQL that they'd
normally put in a view.

On Thu, Mar 11, 2021 at 10:50 PM Tomas Vondra 
wrote:

> On 3/11/21 3:39 PM, Hywel Carver wrote:
> > Great! It looks like it's been in commitfest status for a few years. Is
> > there anything someone like me (outside the pgsql-hackers community) can
> > do to help it get reviewed this time around?
> >
>
> Well, you could do a review, or at least test it with the queries your
> application is actually running. And explain why your application is
> doing queries like this, and why it can't be changed to changed to not
> generate such queries.
>
> The first couple of messages from the patch thread [1] (particularly the
> messages from May 2018) are a good explanation why patches like this are
> tricky to get through.
>
> The basic assumption is that such queries are a bit silly, and it'd be
> probably easier to modify the application not to generate them instead
> of undoing the harm in the database planner. The problem is this makes
> the planner more expensive for everyone, including people who carefully
> write "good" queries.
>
>
> And we don't want to do that, so we need to find a way to make this
> optimization very cheap (essentially "free" if not applicable), but
> that's difficult because there may be cases where the self-joins are
> intentional, and undoing them would make the query slower. And doing
> good decision requires enough information, but this decision needs to
> happen quite early in the planning, when we have very little info.
>
> So while it seems like a simple optimization, it's actually quite tricky
> to get right.
>
> (Of course, there are cases where you may get such queries even if you
> try writing good SQL, say when joining views etc.)
>
> regards
>
> [1]
>
> https://www.postgresql.org/message-id/flat/64486b0b-0404-e39e-322d-080115490...@postgrespro.ru
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


Re: Self-join optimisation

2021-03-11 Thread Tomas Vondra
On 3/11/21 3:39 PM, Hywel Carver wrote:
> Great! It looks like it's been in commitfest status for a few years. Is
> there anything someone like me (outside the pgsql-hackers community) can
> do to help it get reviewed this time around?
> 

Well, you could do a review, or at least test it with the queries your
application is actually running. And explain why your application is
doing queries like this, and why it can't be changed to changed to not
generate such queries.

The first couple of messages from the patch thread [1] (particularly the
messages from May 2018) are a good explanation why patches like this are
tricky to get through.

The basic assumption is that such queries are a bit silly, and it'd be
probably easier to modify the application not to generate them instead
of undoing the harm in the database planner. The problem is this makes
the planner more expensive for everyone, including people who carefully
write "good" queries.


And we don't want to do that, so we need to find a way to make this
optimization very cheap (essentially "free" if not applicable), but
that's difficult because there may be cases where the self-joins are
intentional, and undoing them would make the query slower. And doing
good decision requires enough information, but this decision needs to
happen quite early in the planning, when we have very little info.

So while it seems like a simple optimization, it's actually quite tricky
to get right.

(Of course, there are cases where you may get such queries even if you
try writing good SQL, say when joining views etc.)

regards

[1]
https://www.postgresql.org/message-id/flat/64486b0b-0404-e39e-322d-080115490...@postgrespro.ru

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Self-join optimisation

2021-03-11 Thread Hywel Carver
Great! It looks like it's been in commitfest status for a few years. Is
there anything someone like me (outside the pgsql-hackers community) can do
to help it get reviewed this time around?

On Thu, Mar 11, 2021 at 2:32 PM Matthias van de Meent <
boekewurm+postg...@gmail.com> wrote:

> On Thu, 11 Mar 2021 at 15:15, Hywel Carver  wrote:
> >
> > Hi,
> >
> > I asked this question in the Postgres Slack, and was recommended to ask
> here instead.
> >
> > A few times, I've been in a situation where I want to join a table to
> itself on its primary key. That typically happens because I have some kind
> of summary view, which I then want to join to the original table (using its
> primary key) to flesh out the summary data with other columns. That's
> executed as a join, which surprised me. But in this case, I could extend
> the view to have all of the columns of the original table to avoid the join.
> >
> > But there's another case that's harder to solve this way: combining
> views together. Here's a trivial example:
> >
> > CREATE TABLE users (id BIGINT PRIMARY KEY, varchar name);
> > CREATE VIEW only_some_users AS (SELECT * FROM users WHERE id < 10);
> > CREATE VIEW some_other_users AS (SELECT * FROM users WHERE id > 3);
> >
> > EXPLAIN SELECT * FROM only_some_users
> > INNER JOIN some_other_users ON only_some_users.id = some_other_users.id;
> >
> > Hash Join  (cost=29.23..43.32 rows=90 width=144)
> >   Hash Cond: (users.id = users_1.id)
> >   ->  Bitmap Heap Scan on users  (cost=6.24..19.62 rows=270 width=72)
> > Recheck Cond: (id < 10)
> > ->  Bitmap Index Scan on users_pkey  (cost=0.00..6.18 rows=270
> width=0)
> >   Index Cond: (id < 10)
> >   ->  Hash  (cost=19.62..19.62 rows=270 width=72)
> > ->  Bitmap Heap Scan on users users_1  (cost=6.24..19.62
> rows=270 width=72)
> >   Recheck Cond: (id > 3)
> >   ->  Bitmap Index Scan on users_pkey  (cost=0.00..6.18
> rows=270 width=0)
> > Index Cond: (id > 3)
> >
> > Is there a reason why Postgres doesn't have an optimisation built in to
> optimise this JOIN? What I'm imagining is that a join between two aliases
> for the same table on its primary key could be optimised by treating them
> as the same table. I think the same would be true for self-joins on any
> non-null columns covered by a uniqueness constraint.
> >
> > If this is considered a desirable change, I'd be keen to work on it
> (with some guidance).
>
> There's currently a patch registered in the commitfest that could fix
> this for you, called "Remove self join on a unique column" [0].
>
>
> With regards,
>
> Matthias van de Meent
>
> [0] https://commitfest.postgresql.org/31/1712/, thread at
>
> https://www.postgresql.org/message-id/flat/64486b0b-0404-e39e-322d-080115490...@postgrespro.ru
>


Re: Self-join optimisation

2021-03-11 Thread Justin Pryzby
On Thu, Mar 11, 2021 at 03:32:16PM +0100, Matthias van de Meent wrote:
> On Thu, 11 Mar 2021 at 15:15, Hywel Carver  wrote:
> > I asked this question in the Postgres Slack, and was recommended to ask 
> > here instead.
> >
> > A few times, I've been in a situation where I want to join a table to 
> > itself on its primary key. That typically happens because I have some kind 
> > of summary view, which I then want to join to the original table (using its 
> > primary key) to flesh out the summary data with other columns. That's 
> > executed as a join, which surprised me. But in this case, I could extend 
> > the view to have all of the columns of the original table to avoid the join.
> >
> > But there's another case that's harder to solve this way: combining views 
> > together. Here's a trivial example:
> >
> > CREATE TABLE users (id BIGINT PRIMARY KEY, varchar name);
> > CREATE VIEW only_some_users AS (SELECT * FROM users WHERE id < 10);
> > CREATE VIEW some_other_users AS (SELECT * FROM users WHERE id > 3);
> >
> > EXPLAIN SELECT * FROM only_some_users
> > INNER JOIN some_other_users ON only_some_users.id = some_other_users.id;
> >
> > Hash Join  (cost=29.23..43.32 rows=90 width=144)
> >   Hash Cond: (users.id = users_1.id)
> >   ->  Bitmap Heap Scan on users  (cost=6.24..19.62 rows=270 width=72)
> > Recheck Cond: (id < 10)
> > ->  Bitmap Index Scan on users_pkey  (cost=0.00..6.18 rows=270 
> > width=0)
> >   Index Cond: (id < 10)
> >   ->  Hash  (cost=19.62..19.62 rows=270 width=72)
> > ->  Bitmap Heap Scan on users users_1  (cost=6.24..19.62 rows=270 
> > width=72)
> >   Recheck Cond: (id > 3)
> >   ->  Bitmap Index Scan on users_pkey  (cost=0.00..6.18 
> > rows=270 width=0)
> > Index Cond: (id > 3)
> >
> > Is there a reason why Postgres doesn't have an optimisation built in to 
> > optimise this JOIN? What I'm imagining is that a join between two aliases 
> > for the same table on its primary key could be optimised by treating them 
> > as the same table. I think the same would be true for self-joins on any 
> > non-null columns covered by a uniqueness constraint.
> >
> > If this is considered a desirable change, I'd be keen to work on it (with 
> > some guidance).
> 
> There's currently a patch registered in the commitfest that could fix
> this for you, called "Remove self join on a unique column" [0].

Maybe you'd want to test the patch and send a review.
https://commitfest.postgresql.org/32/1712/

-- 
Justin




Re: Self-join optimisation

2021-03-11 Thread Matthias van de Meent
On Thu, 11 Mar 2021 at 15:15, Hywel Carver  wrote:
>
> Hi,
>
> I asked this question in the Postgres Slack, and was recommended to ask here 
> instead.
>
> A few times, I've been in a situation where I want to join a table to itself 
> on its primary key. That typically happens because I have some kind of 
> summary view, which I then want to join to the original table (using its 
> primary key) to flesh out the summary data with other columns. That's 
> executed as a join, which surprised me. But in this case, I could extend the 
> view to have all of the columns of the original table to avoid the join.
>
> But there's another case that's harder to solve this way: combining views 
> together. Here's a trivial example:
>
> CREATE TABLE users (id BIGINT PRIMARY KEY, varchar name);
> CREATE VIEW only_some_users AS (SELECT * FROM users WHERE id < 10);
> CREATE VIEW some_other_users AS (SELECT * FROM users WHERE id > 3);
>
> EXPLAIN SELECT * FROM only_some_users
> INNER JOIN some_other_users ON only_some_users.id = some_other_users.id;
>
> Hash Join  (cost=29.23..43.32 rows=90 width=144)
>   Hash Cond: (users.id = users_1.id)
>   ->  Bitmap Heap Scan on users  (cost=6.24..19.62 rows=270 width=72)
> Recheck Cond: (id < 10)
> ->  Bitmap Index Scan on users_pkey  (cost=0.00..6.18 rows=270 
> width=0)
>   Index Cond: (id < 10)
>   ->  Hash  (cost=19.62..19.62 rows=270 width=72)
> ->  Bitmap Heap Scan on users users_1  (cost=6.24..19.62 rows=270 
> width=72)
>   Recheck Cond: (id > 3)
>   ->  Bitmap Index Scan on users_pkey  (cost=0.00..6.18 rows=270 
> width=0)
> Index Cond: (id > 3)
>
> Is there a reason why Postgres doesn't have an optimisation built in to 
> optimise this JOIN? What I'm imagining is that a join between two aliases for 
> the same table on its primary key could be optimised by treating them as the 
> same table. I think the same would be true for self-joins on any non-null 
> columns covered by a uniqueness constraint.
>
> If this is considered a desirable change, I'd be keen to work on it (with 
> some guidance).

There's currently a patch registered in the commitfest that could fix
this for you, called "Remove self join on a unique column" [0].


With regards,

Matthias van de Meent

[0] https://commitfest.postgresql.org/31/1712/, thread at
https://www.postgresql.org/message-id/flat/64486b0b-0404-e39e-322d-080115490...@postgrespro.ru




Self-join optimisation

2021-03-11 Thread Hywel Carver
Hi,

I asked this question in the Postgres Slack, and was recommended to ask
here instead.

A few times, I've been in a situation where I want to join a table to
itself on its primary key. That typically happens because I have some kind
of summary view, which I then want to join to the original table (using its
primary key) to flesh out the summary data with other columns. That's
executed as a join, which surprised me. But in this case, I could extend
the view to have all of the columns of the original table to avoid the join.

But there's another case that's harder to solve this way: combining views
together. Here's a trivial example:

CREATE TABLE users (id BIGINT PRIMARY KEY, varchar name);
CREATE VIEW only_some_users AS (SELECT * FROM users WHERE id < 10);
CREATE VIEW some_other_users AS (SELECT * FROM users WHERE id > 3);

EXPLAIN SELECT * FROM only_some_users
INNER JOIN some_other_users ON only_some_users.id = some_other_users.id;

Hash Join  (cost=29.23..43.32 rows=90 width=144)

  Hash Cond: (users.id = users_1.id)

  ->  Bitmap Heap Scan on users  (cost=6.24..19.62 rows=270 width=72)

Recheck Cond: (id < 10)

->  Bitmap Index Scan on users_pkey  (cost=0.00..6.18 rows=270
width=0)
  Index Cond: (id < 10)

  ->  Hash  (cost=19.62..19.62 rows=270 width=72)

->  Bitmap Heap Scan on users users_1  (cost=6.24..19.62 rows=270
width=72)
  Recheck Cond: (id > 3)

  ->  Bitmap Index Scan on users_pkey  (cost=0.00..6.18
rows=270 width=0)
Index Cond: (id > 3)

Is there a reason why Postgres doesn't have an optimisation built in to
optimise this JOIN? What I'm imagining is that a join between two aliases
for the same table on its primary key could be optimised by treating them
as the same table. I think the same would be true for self-joins on any
non-null columns covered by a uniqueness constraint.

If this is considered a desirable change, I'd be keen to work on it (with
some guidance).

Thanks,

Hywel