Re: [HACKERS] improving GROUP BY estimation

Tomas Vondra Sun, 13 Mar 2016 15:10:27 -0700

Hi,

On Sun, 2016-03-13 at 15:24 +0000, Dean Rasheed wrote:
> On 4 March 2016 at 13:10, Tomas Vondra <[email protected]>
> wrote:
> > 
> > The risk associated with over-estimation is that we may switch from
> > HashAggregate to GroupAggregate, and generally select paths better
> > suited for large number of rows. Not great, because the plan may
> > not be
> > optimal and thus run much slower - but that usually only happens on
> > small tables, because on large tables we would probably end up
> > using
> > those same paths anyway.
> > 
> > With under-estimation, the main risks are much graver, as we may
> > end up
> > using HashAggregate only to get killed by OOM. When we're lucky not
> > to
> > hit that, my experience this often leads to a cascade of NestedLoop
> > nodes processing orders of magnitude more tuples than expected.
> > Awful.
> > 
> > So I think the risk is asymmetric, and that's why I like the new
> > estimator more, because it tends to over-estimate, but in a very
> > reasonable way.
> > 
> Agreed. Under-estimating is worse than over-estimating.
> 
> 
> -           reldistinct *= rel->rows / rel->tuples;
> +           reldistinct *= (1 - powl((reldistinct - 1) / reldistinct,
> rel->rows)
> 
> Looking at this, I agree that this new formula from [1] is generally
> better than the old one in most (but not all) cases.
> 
> One problem with it is that it no longer takes into account
> rel->tuples, which means that when returning all the tuples from the
> table, it no longer gives an estimate equal to the total number of
> distinct values in the table. In fact it tends to underestimate when
> returning nearly all the rows from the table.


Yes, that's a good point.

> 
> The old formula is clearly not a good general-purpose estimate.
> However, there is one case where it does return an accurate estimate
> -- when all the values in the underlying table are distinct. In this
> case the new formula consistently produces underestimates, while the
> old formula works well. For example:
> 
> CREATE TABLE t AS SELECT (100000000 * random())::int a,
>                          i::int b
>                     FROM generate_series(1,1000000) s(i);
> ANALYSE t;
> 
> EXPLAIN ANALYSE
> SELECT a FROM t GROUP BY a;
> 
> So there are clearly around 1 million distinct values for "a", but
> the new formula gives an estimate of around 630,000. That's not a
> massive underestimate, but it's sufficiently obvious and visible that
> it would be good to do better if we can.
> 
> 
> I think that a better formula to use would be
> 
> reldistinct *= (1 - powl(1 - rel-rows / rel->tuples, rel->tuples /
> reldistinct)
> 
> This comes from [2], which gives a general formula for selection
> without replacement, and then gives the above as an approximation to
> the uniform distribution case. It's not really made clear in that
> paper, but that approximation is actually the "with replacement"
> approximation, but starting from different initial assumptions to
> give a formula that has better boundary conditions and special-case
> behaviour, as described below.
> 
> ...
> 
> For most values of N and n, the approximation from [2] is almost
> indistinguishable from the exact formula, whereas the formula from
> [1] tends to underestimate when selecting a large number of distinct
> values (e.g., try setting n=900000 in the plot above).

Yes, I agree that formula you propose seems to behave better.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] improving GROUP BY estimation

Reply via email to