On Wed, Jun 17, 2015 at 10:52 AM, Tomas Vondra <tomas.von...@2ndquadrant.com > wrote:
> Hi, > > I'm currently running some tests on a 3TB TPC-H data set, and I tripped > over a pretty bad n_distinct underestimate, causing OOM in HashAgg (which > somehow illustrates the importance of the memory-bounded hashagg patch Jeff > Davis is working on). > > The problem is Q18, particularly this simple subquery: > > select l_orderkey > from lineitem > group by l_orderkey > having sum(l_quantity) > 313; > > which is planned like this: > > QUERY PLAN > > --------------------------------------------------------------------------------- > HashAggregate (cost=598510163.92..598515393.93 rows=418401 width=12) > Group Key: l_orderkey > Filter: (sum(l_quantity) > '313'::double precision) > -> Seq Scan on lineitem (cost=0.00..508509923.28 rows=18000048128 > width=12) > (4 rows) > > but sadly, in reality the l_orderkey cardinality looks like this: > > tpch=# select count(distinct l_orderkey) from lineitem; > count > ------------ > 4500000000 > (1 row) > > That's a helluva difference - not the usual one or two orders of > magnitude, but 10000x underestimate. > Is the row order in the table correlated with the value l_orderkey? Could you create copy of the table ordered at random, and see if it exhibits the same estimation issue? Cheers, Jeff