Re: Removing unneeded self joins

Andrei Lepikhov Wed, 18 Oct 2023 20:16:22 -0700

On 19/10/2023 01:50, Alexander Korotkov wrote:

On Mon, Oct 16, 2023 at 11:28 AM Andrei Lepikhov
<[email protected]> wrote:

On 12/10/2023 18:32, Alexander Korotkov wrote:

On Thu, Oct 5, 2023 at 12:17 PM Andrei Lepikhov
<[email protected]> wrote:

On 4/10/2023 14:34, Alexander Korotkov wrote:

   > Relid replacement machinery is the most contradictory code here. We used
   > a utilitarian approach and implemented a simplistic variant.


   > > 2) It would be nice to skip the insertion of IS NOT NULL checks when
   > > they are not necessary.  [1] points that infrastructure from [2] might
   > > be useful.  The patchset from [2] seems committed mow.  However, I
   > > can't see it is directly helpful in this matter.  Could we just skip
   > > adding IS NOT NULL clause for the columns, that have
   > > pg_attribute.attnotnull set?
   > Thanks for the links, I will look into that case.

To be more precise, in the attachment, you can find a diff to the main
patch, which shows the volume of changes to achieve the desired behaviour.
Some explains in regression tests shifted. So, I've made additional tests:

DROP TABLE test CASCADE;
CREATE TABLE test (a int, b int not null);
CREATE UNIQUE INDEX abc ON test(b);
explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a)
WHERE t1.b=t2.b;
CREATE UNIQUE INDEX abc1 ON test(a,b);
explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a)
WHERE t1.b=t2.b;
explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a)
WHERE t1.b=t2.b AND (t1.a=t2.a OR t2.a=t1.a);
DROP INDEX abc1;
explain SELECT * FROM test t1 JOIN test t2 ON (t1.a=t2.a)
WHERE t1.b=t2.b AND (t1.b=t2.b OR t2.b=t1.b);

We have almost the results we wanted to have. But in the last explain
you can see that nothing happened with the OR clause. We should use the
expression mutator instead of walker to handle such clauses. But It
doesn't process the RestrictInfo node ... I'm inclined to put a solution
of this issue off for a while.


OK.  I think it doesn't worth to eliminate IS NULL quals with this
complexity (at least at this stage of work).

I made improvements over the code.  Mostly new comments, grammar
corrections of existing comments and small refactoring.

Also, I found that the  suggestion from David Rowley [1] to qsort
array of relations to faster find duplicates is still unaddressed.
I've implemented it.  That helps to evade quadratic complexity with
large number of relations.

Also I've incorporated improvements from Alena Rybakina except one for
skipping SJ removal when no SJ quals is found.  It's not yet clear for
me if this check fix some cases. But at least optimization got skipped
in some useful cases (as you can see in regression tests).


I would like to propose one more minor improvement (see in attachment).
The idea here is that after removing a self-join and changing clauses we
should re-probe the set of relids with the same Oid, because we can find
more removable self-joins (see the demo test in join.sql).



Thank you, I've integrated this into the patch.  BTW, the patch
introduces two new GUC variables: enable_self_join_removal,
self_join_search_limit.  enable_self_join_removal variable turns
on/off optimization at all.  self_join_search_limit variable limits
its usage by the number of joins.  AFICS, self_join_search_limit is
intended to protect us from quadratic complexity self-join removal
has.  I tried to reproduce the extreme case.

SELECT count(*) FROM pgbench_accounts a0, pgbench_accounts a1, ...,
pgbench_accounts a100 WHERE a0.aid = 1 AND a1.aid = a0.aid + 1 AND ...
AND a100.aid = a99.aid + 1;

This query took 3778.432 ms with self-join removal disabled, and
3756.009 ms with self-join removal enabled.  So, no measurable
overhead.  Similar to the higher number of joins.  Can you imagine
some extreme case when self-join removal could introduce significant
overhead in comparison with other optimizer parts?  If not, should we
remove self_join_search_limit GUC?

Thanks,

It was Zhihong Yu who worried about that case [1]. And my purpose was toshow a method to avoid such a problem if it would be needed.I guess the main idea here is that we have a lot of self-joins, but onlyfew of them (or no one) can be removed.I can't imagine a practical situation when we can be stuck in theproblems here. So, I vote to remove this GUC.

[1]https://www.postgresql.org/message-id/CALNJ-vTyL-LpvSOPZxpC63Et3LJLUAFZSfRqGEhT5Rj7_EEj7w%40mail.gmail.com


--
regards,
Andrey Lepikhov
Postgres Professional

Re: Removing unneeded self joins

Reply via email to