Re: [HACKERS] pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

Jeff Janes Fri, 19 Jun 2015 11:34:09 -0700

On Wed, Jun 17, 2015 at 10:52 AM, Tomas Vondra <[email protected]
> wrote:


> Hi,
>
> I'm currently running some tests on a 3TB TPC-H data set, and I tripped
> over a pretty bad n_distinct underestimate, causing OOM in HashAgg (which
> somehow illustrates the importance of the memory-bounded hashagg patch Jeff
> Davis is working on).
>
> The problem is Q18, particularly this simple subquery:
>
>     select l_orderkey
>     from lineitem
>     group by l_orderkey
>     having sum(l_quantity) > 313;
>
> which is planned like this:
>
>                                QUERY PLAN
>
> ---------------------------------------------------------------------------------
>  HashAggregate  (cost=598510163.92..598515393.93 rows=418401 width=12)
>    Group Key: l_orderkey
>    Filter: (sum(l_quantity) > '313'::double precision)
>    ->  Seq Scan on lineitem  (cost=0.00..508509923.28 rows=18000048128
> width=12)
> (4 rows)
>
> but sadly, in reality the l_orderkey cardinality looks like this:
>
>     tpch=# select count(distinct l_orderkey) from lineitem;
>        count
>     ------------
>      4500000000
>     (1 row)
>
> That's a helluva difference - not the usual one or two orders of
> magnitude, but 10000x underestimate.
>


Is the row order in the table correlated with the value l_orderkey?

Could you create copy of the table ordered at random, and see if it
exhibits the same estimation issue?

Cheers,

Jeff

Re: [HACKERS] pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H

Reply via email to