Aman Sinha created IMPALA-12168: ----------------------------------- Summary: Aggregate's cardinality overestimated for a count distinct query with grouping on same column Key: IMPALA-12168 URL: https://issues.apache.org/jira/browse/IMPALA-12168 Project: IMPALA Issue Type: Bug Components: Frontend Reporter: Aman Sinha
When the COUNT(DISTINCT col) is on the same column as the GROUP BY, we currently overestimate the cardinality: {noformat} [localhost:21050] tpch> explain select l_shipdate, l_quantity, count(distinct l_shipdate) from lineitem group by l_shipdate, l_quantity; Query: explain select l_shipdate, l_quantity, count(distinct l_shipdate) from lineitem group by l_shipdate, l_quantity +-------------------------------------------------------------+ | Explain String | +-------------------------------------------------------------+ | Max Per-Host Resource Reservation: Memory=89.75MB Threads=5 | | Per-Host Resource Estimates: Memory=517MB | | | | PLAN-ROOT SINK | | | | | 07:EXCHANGE [UNPARTITIONED] | | | | | 06:AGGREGATE [FINALIZE] | | | output: count:merge(l_shipdate) | | | group by: l_shipdate, l_quantity | | | row-size=38B cardinality=134.08K | | | | | 05:EXCHANGE [HASH(l_shipdate,l_quantity)] | | | | | 02:AGGREGATE [STREAMING] | | | output: count(l_shipdate) | | | group by: l_shipdate, l_quantity | | | row-size=38B cardinality=134.08K | | | | | 04:AGGREGATE | | | group by: l_shipdate, l_quantity, l_shipdate | | | row-size=52B cardinality=6.00M | | | | | 03:EXCHANGE [HASH(l_shipdate,l_quantity,l_shipdate)] | | | | | 01:AGGREGATE [STREAMING] | | | group by: l_shipdate, l_quantity, l_shipdate | | | row-size=52B cardinality=6.00M | | | | | 00:SCAN HDFS [tpch.lineitem] | | HDFS partitions=1/1 files=1 size=718.94MB | | row-size=30B cardinality=6.00M | +-------------------------------------------------------------+ {noformat} Here the lower Streaming Agg's cardinality is 6M rows whereas the actual rowcount for that phase is 376K: {noformat} 01:AGGREGATE 3 3 327.441ms 350.542ms 376.80K 6.00M 36.11 MB 91.57 MB STREAMING {noformat} The NDV column stats for this table: {noformat} [localhost:21050] tpch> show column stats lineitem; Query: show column stats lineitem +-----------------+---------------+------------------+--------+----------+---------------+--------+---------+ | Column | Type | #Distinct Values | #Nulls | Max Size | Avg Size | #Trues | #Falses | +-----------------+---------------+------------------+--------+----------+---------------+--------+---------+ | l_orderkey | BIGINT | 1563438 | 0 | 8 | 8.0 | -1 | -1 | | l_partkey | BIGINT | 200516 | 0 | 8 | 8.0 | -1 | -1 | | l_suppkey | BIGINT | 9712 | 0 | 8 | 8.0 | -1 | -1 | | l_linenumber | INT | 7 | 0 | 4 | 4.0 | -1 | -1 | | l_quantity | DECIMAL(12,2) | 51 | 0 | 8 | 8.0 | -1 | -1 | | l_extendedprice | DECIMAL(12,2) | 868550 | 0 | 8 | 8.0 | -1 | -1 | | l_discount | DECIMAL(12,2) | 11 | 0 | 8 | 8.0 | -1 | -1 | | l_tax | DECIMAL(12,2) | 9 | 0 | 8 | 8.0 | -1 | -1 | | l_returnflag | STRING | 3 | 0 | 1 | 1.0 | -1 | -1 | | l_linestatus | STRING | 2 | 0 | 1 | 1.0 | -1 | -1 | | l_shipdate | STRING | 2629 | 0 | 10 | 10.0 | -1 | -1 | | l_commitdate | STRING | 2559 | 0 | 10 | 10.0 | -1 | -1 | | l_receiptdate | STRING | 2658 | 0 | 10 | 10.0 | -1 | -1 | | l_shipinstruct | STRING | 4 | 0 | 17 | 11.9986381531 | -1 | -1 | | l_shipmode | STRING | 7 | 0 | 7 | 4.28530454636 | -1 | -1 | | l_comment | STRING | 4652621 | 0 | 43 | 26.4941692352 | -1 | -1 | +-----------------+---------------+------------------+--------+----------+---------------+--------+---------+ {noformat} The reason for this overestimation is the group by exprs contain duplicate column l_shipdate: {noformat} group by: l_shipdate, l_quantity, l_shipdate {noformat} Due to this extra duplicate column, we compute the cardinality as NDV(l_shipdate) * NDV(l_quantity) * NDV(l_shipdate) = 352M which is then capped at 6M since the child produces 6M rows. The right estimate should be NDV(l_shipdate) * NDV(l_quantity) = 134K -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org