Re: eqjoinsel_semi still sucks ...

Tom Lane Wed, 28 Jun 2023 09:28:16 -0700

Alena Rybakina <lena.riback...@yandex.ru> writes:
> On 23.06.2023 14:28, Andrey Lepikhov wrote:
>> On 2/5/2012 20:34, Tom Lane wrote:
>>> On reflection I think that the idea of clamping ndistinct beforehand is
>>> just wrong, and what we ought to do instead is apply a multiplier to the
>>> selectivity estimate afterwards.


>> I got stuck in some cases where (due to a tree of filters) the planner 
>> underestimates the JOIN just because the ndistinct conveys a huge 
>> number to the selectivity estimation formula. However, the estimation 
>> of both input relations is made correctly and is limited.
>> I've tried to understand the logic through commits 0d3b231eebf, 
>> 97930cf578e and 7f3eba30c9d. But it is still not clear.
>> So, why the idea of clamping ndistinct is terrible in general? Could 
>> you explain your reasons a bit more?

> Hi, I also have an interest in understanding the same issue, and I dug 
> into the above commits and the topic . I hope this letter will help to 
> sort out this issue.

Note that nothing's actually been done about the idea I expressed at the
start of this thread.  The behavior is better now: if you try the example
in v12 or later you get

regression=# explain analyze select * from tenk1 a where not exists(select 1 
from tenk1 b where a.thousand = b.unique2 and b.two = 0);
                                                     QUERY PLAN                 
                                     
---------------------------------------------------------------------------------------------------------------------
 Hash Anti Join  (cost=532.50..1059.38 rows=5000 width=244) (actual 
time=1.993..3.959 rows=5090 loops=1)
   Hash Cond: (a.thousand = b.unique2)
   ->  Seq Scan on tenk1 a  (cost=0.00..445.00 rows=10000 width=244) (actual 
time=0.003..0.502 rows=10000 loops=1)
   ->  Hash  (cost=470.00..470.00 rows=5000 width=4) (actual time=1.960..1.960 
rows=5000 loops=1)
         Buckets: 8192  Batches: 1  Memory Usage: 240kB
         ->  Seq Scan on tenk1 b  (cost=0.00..470.00 rows=5000 width=4) (actual 
time=0.003..1.449 rows=5000 loops=1)
               Filter: (two = 0)
               Rows Removed by Filter: 5000
 Planning Time: 0.760 ms
 Execution Time: 4.087 ms
(10 rows)

I believe this improvement just stems from commit a314c3407 ("Clamp
semijoin selectivity to be not more than inner-join selectivity"),
and so this example looks good only because the actual semijoin
output size happens to be the same as the inner-join output size.
With a slight adjustment, that's no longer true and the estimate
still sucks:

regression=# explain analyze select * from tenk1 a, tenk1 b where a.thousand = 
b.fivethous and b.two = 0;
                                                      QUERY PLAN                
                                       
-----------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=532.50..1127.50 rows=10000 width=488) (actual 
time=2.000..5.742 rows=10000 loops=1)
...

regression=# explain analyze select * from tenk1 a where exists(select 1 from 
tenk1 b where a.thousand = b.fivethous and b.two = 0);
                                                     QUERY PLAN                 
                                     
---------------------------------------------------------------------------------------------------------------------
 Hash Semi Join  (cost=532.50..1127.50 rows=10000 width=244) (actual 
time=1.529..3.474 rows=5000 loops=1)
...

regression=# explain analyze select * from tenk1 a where not exists(select 1 
from tenk1 b where a.thousand = b.fivethous and b.two = 0);
                                                     QUERY PLAN                 
                                     
---------------------------------------------------------------------------------------------------------------------
 Hash Anti Join  (cost=532.50..1027.50 rows=1 width=244) (actual 
time=1.564..3.476 rows=5000 loops=1)
...

So we still have an issue to fix.  I've not had time to pursue 
the idea I expressed at the start of the thread.  Also, it's a
bit scary to change this logic with only a few examples to look
at.  If you could reduce the problem cases you're looking at
to something sharable, that could help move things along.

                        regards, tom lane

Re: eqjoinsel_semi still sucks ...

Reply via email to