On Wed, Jun 17, 2015 at 10:52 AM, Tomas Vondra <tomas.von...@2ndquadrant.com
> wrote:

> Hi,
>
> I'm currently running some tests on a 3TB TPC-H data set, and I tripped
> over a pretty bad n_distinct underestimate, causing OOM in HashAgg (which
> somehow illustrates the importance of the memory-bounded hashagg patch Jeff
> Davis is working on).
>
> The problem is Q18, particularly this simple subquery:
>
>     select l_orderkey
>     from lineitem
>     group by l_orderkey
>     having sum(l_quantity) > 313;
>
> which is planned like this:
>
>                                QUERY PLAN
>
> ---------------------------------------------------------------------------------
>  HashAggregate  (cost=598510163.92..598515393.93 rows=418401 width=12)
>    Group Key: l_orderkey
>    Filter: (sum(l_quantity) > '313'::double precision)
>    ->  Seq Scan on lineitem  (cost=0.00..508509923.28 rows=18000048128
> width=12)
> (4 rows)
>
> but sadly, in reality the l_orderkey cardinality looks like this:
>
>     tpch=# select count(distinct l_orderkey) from lineitem;
>        count
>     ------------
>      4500000000
>     (1 row)
>
> That's a helluva difference - not the usual one or two orders of
> magnitude, but 10000x underestimate.
>
> The usual thing to do in this case is increasing statistics target, and
> while this improves the estimate, the improvement is rather small:
>
>    statistics target     estimate        difference
>    --------------------------------------------------
>    100                     429491             10000
>    1000                   4240418              1000
>    10000                 42913759               100
>
> I find the pattern rather strange - every time the statistics target
> increases 10x, the difference decreases 10x - maybe that's natural, but the
> perfect proportionality is suspicious IMHO.
>
> Also, this is a quite large dataset - the table has ~18 billion rows, and
> even with target=10000 we're sampling only 3M rows, which is ~0.02%. That's
> a tiny sample, so inaccuracy is naturally expected, but OTOH the TPC-H
> dataset is damn uniform - there's pretty much no skew in the distributions
> AFAIK. So I'd expect a slightly better result.
>
> With target=10000 the plan switches to GroupAggregate, because the
> estimate gets sufficient to exceed work_mem (2GB). But it's still way off,
> and it's mostly just a lucky coincidence.
>
> So I'm wondering if there's some bug because of the dataset size (an
> integer overflow or something like), so I added a bunch of logging into the
> estimator, logging all the parameters computed:
>
> target=100 (samplerows=30000)
> -----------------------------
> WARNING:  attnum=1 attname=l_orderkey f1=27976 ndistinct=28977
> nmultiple=1001 toowide_cnt=0 d=28977 numer=869310000.000000
> denom=2024.046627 stadistinct=429491.094029
> WARNING:  ndistinct estimate attnum=1 attname=l_orderkey current=429491.09
> adaptive=443730.00
>
> target=1000 (samplerows=300000)
> -------------------------------
> WARNING:  attnum=1 attname=l_orderkey f1=279513 ndistinct=289644
> nmultiple=10131 toowide_cnt=0 d=289644 numer=86893200000.000000
> denom=20491.658538 stadistinct=4240418.111618
> WARNING:  ndistinct estimate attnum=1 attname=l_orderkey
> current=4240418.11 adaptive=4375171.00
>
> target=10000 (samplerows=3000000)
> ---------------------------------
> WARNING:  attnum=1 attname=l_orderkey f1=2797888 ndistinct=2897799
> nmultiple=99911 toowide_cnt=0 d=2897799 numer=8693397000000.000000
> denom=202578.313396 stadistinct=42913759.396282
> WARNING:  ndistinct estimate attnum=1 attname=l_orderkey
> current=42913759.40 adaptive=44449882.00
>
> It's totalrows=18000049031 in all cases. The logs also show estimate
> produced by the adaptive estimate (discussed in a separate thread), but
> apparently that does not change the estimates much :-(
>
> Any ideas?
>
> --
> Tomas Vondra                   http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


While better sample/stats is important for choosing a good plan, in this
query, hash agg is really the right plan.   If a sort agg is chosen, the
performance will be really really bad.   The patch that Jeff is working on
is critical for a decent TPCH number (unless you have unlimited amount of
memory).

Thanks,

Reply via email to