Hi, On Sun, 2016-03-13 at 15:24 +0000, Dean Rasheed wrote: > On 4 March 2016 at 13:10, Tomas Vondra <tomas.von...@2ndquadrant.com> > wrote: > > > > The risk associated with over-estimation is that we may switch from > > HashAggregate to GroupAggregate, and generally select paths better > > suited for large number of rows. Not great, because the plan may > > not be > > optimal and thus run much slower - but that usually only happens on > > small tables, because on large tables we would probably end up > > using > > those same paths anyway. > > > > With under-estimation, the main risks are much graver, as we may > > end up > > using HashAggregate only to get killed by OOM. When we're lucky not > > to > > hit that, my experience this often leads to a cascade of NestedLoop > > nodes processing orders of magnitude more tuples than expected. > > Awful. > > > > So I think the risk is asymmetric, and that's why I like the new > > estimator more, because it tends to over-estimate, but in a very > > reasonable way. > > > Agreed. Under-estimating is worse than over-estimating. > > > - reldistinct *= rel->rows / rel->tuples; > + reldistinct *= (1 - powl((reldistinct - 1) / reldistinct, > rel->rows) > > Looking at this, I agree that this new formula from [1] is generally > better than the old one in most (but not all) cases. > > One problem with it is that it no longer takes into account > rel->tuples, which means that when returning all the tuples from the > table, it no longer gives an estimate equal to the total number of > distinct values in the table. In fact it tends to underestimate when > returning nearly all the rows from the table.
Yes, that's a good point. > > The old formula is clearly not a good general-purpose estimate. > However, there is one case where it does return an accurate estimate > -- when all the values in the underlying table are distinct. In this > case the new formula consistently produces underestimates, while the > old formula works well. For example: > > CREATE TABLE t AS SELECT (100000000 * random())::int a, > i::int b > FROM generate_series(1,1000000) s(i); > ANALYSE t; > > EXPLAIN ANALYSE > SELECT a FROM t GROUP BY a; > > So there are clearly around 1 million distinct values for "a", but > the new formula gives an estimate of around 630,000. That's not a > massive underestimate, but it's sufficiently obvious and visible that > it would be good to do better if we can. > > > I think that a better formula to use would be > > reldistinct *= (1 - powl(1 - rel-rows / rel->tuples, rel->tuples / > reldistinct) > > This comes from [2], which gives a general formula for selection > without replacement, and then gives the above as an approximation to > the uniform distribution case. It's not really made clear in that > paper, but that approximation is actually the "with replacement" > approximation, but starting from different initial assumptions to > give a formula that has better boundary conditions and special-case > behaviour, as described below. > > ... > > For most values of N and n, the approximation from [2] is almost > indistinguishable from the exact formula, whereas the formula from > [1] tends to underestimate when selecting a large number of distinct > values (e.g., try setting n=900000 in the plot above). Yes, I agree that formula you propose seems to behave better. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers