On 06/19/2015 08:32 PM, Jeff Janes wrote:
On Wed, Jun 17, 2015 at 10:52 AM, Tomas Vondra
<tomas.von...@2ndquadrant.com <mailto:tomas.von...@2ndquadrant.com>> wrote:

    Hi,

    I'm currently running some tests on a 3TB TPC-H data set, and I
    tripped over a pretty bad n_distinct underestimate, causing OOM in
    HashAgg (which somehow illustrates the importance of the
    memory-bounded hashagg patch Jeff Davis is working on).

    The problem is Q18, particularly this simple subquery:

         select l_orderkey
         from lineitem
         group by l_orderkey
         having sum(l_quantity) > 313;

    which is planned like this:

                                    QUERY PLAN
    
---------------------------------------------------------------------------------
      HashAggregate  (cost=598510163.92..598515393.93 rows=418401 width=12)
        Group Key: l_orderkey
        Filter: (sum(l_quantity) > '313'::double precision)
        ->  Seq Scan on lineitem  (cost=0.00..508509923.28
    rows=18000048128 width=12)
    (4 rows)

    but sadly, in reality the l_orderkey cardinality looks like this:

         tpch=# select count(distinct l_orderkey) from lineitem;
            count
         ------------
          4500000000
         (1 row)

    That's a helluva difference - not the usual one or two orders of
    magnitude, but 10000x underestimate.



Is the row order in the table correlated with the value l_orderkey?

The statistics look like this:

 attnum |     attname     | n_distinct  | correlation
--------+-----------------+-------------+-------------
      1 | l_orderkey      | 4.27349e+07 |    0.988631

so yes, it's pretty much perfectly correlated.

Could you create copy of the table ordered at random, and see if it
exhibits the same estimation issue?

Sadly no - this is a 2.5TB table, so it's really easy to do. But I don't really see why that should matter, because the sample is supposed to be random.

But I think you might be on to something, because I manually collected a random sample with 30k rows (by explicitly generating 30k random TIDs), and I get this:

tpch=# select cnt, count(*) from (select l_orderkey, count(*) AS cnt from lineitem_sample group by 1) foo group by 1;

 cnt | count
-----+-------
   1 | 29998
   2 |     1
(2 rows)


That's quite different compared to what analyze gets, which effectively looks something like this (this is derived from the logs, so not perfectly accurate - I only have f1, ndistinct, nmultiple):

 cnt | count
-----+-------
   1 | 27976
   2 |   976
   3 |    24

Am I wrong or is the sample not that random?

--
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to