Alena Rybakina <lena.riback...@yandex.ru> writes:
> On 23.06.2023 14:28, Andrey Lepikhov wrote:
>> On 2/5/2012 20:34, Tom Lane wrote:
>>> On reflection I think that the idea of clamping ndistinct beforehand is
>>> just wrong, and what we ought to do instead is apply a multiplier to the
>>> selectivity estimate afterwards.

>> I got stuck in some cases where (due to a tree of filters) the planner 
>> underestimates the JOIN just because the ndistinct conveys a huge 
>> number to the selectivity estimation formula. However, the estimation 
>> of both input relations is made correctly and is limited.
>> I've tried to understand the logic through commits 0d3b231eebf, 
>> 97930cf578e and 7f3eba30c9d. But it is still not clear.
>> So, why the idea of clamping ndistinct is terrible in general? Could 
>> you explain your reasons a bit more?

> Hi, I also have an interest in understanding the same issue, and I dug 
> into the above commits and the topic . I hope this letter will help to 
> sort out this issue.

Note that nothing's actually been done about the idea I expressed at the
start of this thread.  The behavior is better now: if you try the example
in v12 or later you get

regression=# explain analyze select * from tenk1 a where not exists(select 1 
from tenk1 b where a.thousand = b.unique2 and b.two = 0);
                                                     QUERY PLAN                 
                                     
---------------------------------------------------------------------------------------------------------------------
 Hash Anti Join  (cost=532.50..1059.38 rows=5000 width=244) (actual 
time=1.993..3.959 rows=5090 loops=1)
   Hash Cond: (a.thousand = b.unique2)
   ->  Seq Scan on tenk1 a  (cost=0.00..445.00 rows=10000 width=244) (actual 
time=0.003..0.502 rows=10000 loops=1)
   ->  Hash  (cost=470.00..470.00 rows=5000 width=4) (actual time=1.960..1.960 
rows=5000 loops=1)
         Buckets: 8192  Batches: 1  Memory Usage: 240kB
         ->  Seq Scan on tenk1 b  (cost=0.00..470.00 rows=5000 width=4) (actual 
time=0.003..1.449 rows=5000 loops=1)
               Filter: (two = 0)
               Rows Removed by Filter: 5000
 Planning Time: 0.760 ms
 Execution Time: 4.087 ms
(10 rows)

I believe this improvement just stems from commit a314c3407 ("Clamp
semijoin selectivity to be not more than inner-join selectivity"),
and so this example looks good only because the actual semijoin
output size happens to be the same as the inner-join output size.
With a slight adjustment, that's no longer true and the estimate
still sucks:

regression=# explain analyze select * from tenk1 a, tenk1 b where a.thousand = 
b.fivethous and b.two = 0;
                                                      QUERY PLAN                
                                       
-----------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=532.50..1127.50 rows=10000 width=488) (actual 
time=2.000..5.742 rows=10000 loops=1)
...

regression=# explain analyze select * from tenk1 a where exists(select 1 from 
tenk1 b where a.thousand = b.fivethous and b.two = 0);
                                                     QUERY PLAN                 
                                     
---------------------------------------------------------------------------------------------------------------------
 Hash Semi Join  (cost=532.50..1127.50 rows=10000 width=244) (actual 
time=1.529..3.474 rows=5000 loops=1)
...

regression=# explain analyze select * from tenk1 a where not exists(select 1 
from tenk1 b where a.thousand = b.fivethous and b.two = 0);
                                                     QUERY PLAN                 
                                     
---------------------------------------------------------------------------------------------------------------------
 Hash Anti Join  (cost=532.50..1027.50 rows=1 width=244) (actual 
time=1.564..3.476 rows=5000 loops=1)
...

So we still have an issue to fix.  I've not had time to pursue 
the idea I expressed at the start of the thread.  Also, it's a
bit scary to change this logic with only a few examples to look
at.  If you could reduce the problem cases you're looking at
to something sharable, that could help move things along.

                        regards, tom lane


Reply via email to