Re: [BUGS] Botched estimation in eqjoinsel_semi for cases without reliable ndistinct

Andres Freund Wed, 11 Jan 2012 18:04:47 -0800

On Thursday, January 12, 2012 02:24:44 AM Tom Lane wrote:
> Andres Freund <[email protected]> writes:
> > On Thursday, January 12, 2012 01:01:01 AM Tom Lane wrote:
> >> Looks pretty bogus to me.  You're essentially assuming that the side of
> >> the join without statistics is unique, which is a mighty dubious
> >> assumption.
> > 
> > It sure is a bit dubious. But assuming that a semijoin that has max of n
> > rows on the inner side results in half of the outer sides rows (>> n) is
> > pretty bogus as well.
> 
> How so?  There is no reason to think that the number of LHS rows with a
> match is limited by the number of RHS rows.  If we knew that the LHS
> join key was unique, then yes that'd be sensible, but we don't know
> that.
In the current example we have an estimate for the distinctness of the LHS. I 
don't see how guesstimating the RHS number of tuples in a semijoin to 
vardata2->rel->rows will be worse than just assuming a selectivity of 0.5 for 
the whole thing. Possibly it would make sense to additionally clamp the 
selectivity to an upper limit of 0.5 in that case.
I guess what I am aiming at is that overestimating the number of distinct 
tuples in the *RHS* won't lead to underestimating the number of result tuples. 
By clamping the selectivity to 0.5 in that case we can be sure not to 
overestimate more than currently.


> > SELECT * FROM blub WHERE foo IN (SELECT something_with_aggregation);
> > is not exactly a fringe case, so I find it problematic regressing
> > quite a bit in the estimates.
> 
> Agreed, and that's why I don't want to put in a patch that carries the
> risk of regressing even more.  I'm happy to do something that's got some
> amount of theory behind it, but if we're just guessing, we can't afford
> to guess a very high or low selectivity.
Not sure how more an estimation can regress than:
 Nested Loop  (cost=0.00..1859.02 rows=2550000 width=11) (actual 
time=0.182..11.236 rows=199 loops=1)

;)

> One thing I've considered but not done anything about is that in a lot
> of practical cases for this, the aggregation or grouping properties of
> the sub-select would provide adequate reason for assuming its output is
> more or less unique, so that taking ndistinct equal to number of rows
> actually is sane.  But it would need a bit of thought about what
> properties we want to treat as justifying such an assumption, and then
> some code to see if the join key is a Var coming out of such a
> sub-select.  (Actually, what such a patch would probably look like is
> modifying examine_simple_variable to not just punt when it finds the
> Var came from an aggregating subquery.)
Yes, having that would be great, but be a bit more invasive than I like to 
think right now. This thing is actually a problem for people in the field...

Andres-

-- 
Sent via pgsql-bugs mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [BUGS] Botched estimation in eqjoinsel_semi for cases without reliable ndistinct

Reply via email to