On Thursday, January 12, 2012 02:24:44 AM Tom Lane wrote: > Andres Freund <and...@anarazel.de> writes: > > On Thursday, January 12, 2012 01:01:01 AM Tom Lane wrote: > >> Looks pretty bogus to me. You're essentially assuming that the side of > >> the join without statistics is unique, which is a mighty dubious > >> assumption. > > > > It sure is a bit dubious. But assuming that a semijoin that has max of n > > rows on the inner side results in half of the outer sides rows (>> n) is > > pretty bogus as well. > > How so? There is no reason to think that the number of LHS rows with a > match is limited by the number of RHS rows. If we knew that the LHS > join key was unique, then yes that'd be sensible, but we don't know > that. In the current example we have an estimate for the distinctness of the LHS. I don't see how guesstimating the RHS number of tuples in a semijoin to vardata2->rel->rows will be worse than just assuming a selectivity of 0.5 for the whole thing. Possibly it would make sense to additionally clamp the selectivity to an upper limit of 0.5 in that case. I guess what I am aiming at is that overestimating the number of distinct tuples in the *RHS* won't lead to underestimating the number of result tuples. By clamping the selectivity to 0.5 in that case we can be sure not to overestimate more than currently.
> > SELECT * FROM blub WHERE foo IN (SELECT something_with_aggregation); > > is not exactly a fringe case, so I find it problematic regressing > > quite a bit in the estimates. > > Agreed, and that's why I don't want to put in a patch that carries the > risk of regressing even more. I'm happy to do something that's got some > amount of theory behind it, but if we're just guessing, we can't afford > to guess a very high or low selectivity. Not sure how more an estimation can regress than: Nested Loop (cost=0.00..1859.02 rows=2550000 width=11) (actual time=0.182..11.236 rows=199 loops=1) ;) > One thing I've considered but not done anything about is that in a lot > of practical cases for this, the aggregation or grouping properties of > the sub-select would provide adequate reason for assuming its output is > more or less unique, so that taking ndistinct equal to number of rows > actually is sane. But it would need a bit of thought about what > properties we want to treat as justifying such an assumption, and then > some code to see if the join key is a Var coming out of such a > sub-select. (Actually, what such a patch would probably look like is > modifying examine_simple_variable to not just punt when it finds the > Var came from an aggregating subquery.) Yes, having that would be great, but be a bit more invasive than I like to think right now. This thing is actually a problem for people in the field... Andres- -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs