Aman Sinha created IMPALA-12168:
-----------------------------------

             Summary: Aggregate's cardinality overestimated for a count 
distinct query with grouping on same column
                 Key: IMPALA-12168
                 URL: https://issues.apache.org/jira/browse/IMPALA-12168
             Project: IMPALA
          Issue Type: Bug
          Components: Frontend
            Reporter: Aman Sinha


When the COUNT(DISTINCT col) is on the same column as the GROUP BY, we 
currently overestimate the cardinality: 
{noformat}
[localhost:21050] tpch> explain select l_shipdate, l_quantity, count(distinct 
l_shipdate) from lineitem group by l_shipdate, l_quantity;
Query: explain select l_shipdate, l_quantity, count(distinct l_shipdate) from 
lineitem group by l_shipdate, l_quantity
+-------------------------------------------------------------+
| Explain String                                              |
+-------------------------------------------------------------+
| Max Per-Host Resource Reservation: Memory=89.75MB Threads=5 |
| Per-Host Resource Estimates: Memory=517MB                   |
|                                                             |
| PLAN-ROOT SINK                                              |
| |                                                           |
| 07:EXCHANGE [UNPARTITIONED]                                 |
| |                                                           |
| 06:AGGREGATE [FINALIZE]                                     |
| |  output: count:merge(l_shipdate)                          |
| |  group by: l_shipdate, l_quantity                         |
| |  row-size=38B cardinality=134.08K                         |
| |                                                           |
| 05:EXCHANGE [HASH(l_shipdate,l_quantity)]                   |
| |                                                           |
| 02:AGGREGATE [STREAMING]                                    |
| |  output: count(l_shipdate)                                |
| |  group by: l_shipdate, l_quantity                         |
| |  row-size=38B cardinality=134.08K                         |
| |                                                           |
| 04:AGGREGATE                                                |
| |  group by: l_shipdate, l_quantity, l_shipdate             |
| |  row-size=52B cardinality=6.00M                           |
| |                                                           |
| 03:EXCHANGE [HASH(l_shipdate,l_quantity,l_shipdate)]        |
| |                                                           |
| 01:AGGREGATE [STREAMING]                                    |
| |  group by: l_shipdate, l_quantity, l_shipdate             |
| |  row-size=52B cardinality=6.00M                           |
| |                                                           |
| 00:SCAN HDFS [tpch.lineitem]                                |
|    HDFS partitions=1/1 files=1 size=718.94MB                |
|    row-size=30B cardinality=6.00M                           |
+-------------------------------------------------------------+
{noformat}

Here the lower Streaming Agg's cardinality is 6M rows whereas the actual 
rowcount for that phase is 376K:
{noformat}
01:AGGREGATE               3      3  327.441ms  350.542ms  376.80K       6.00M  
 36.11 MB       91.57 MB  STREAMING  
{noformat}

The NDV column stats for this table:
{noformat}
[localhost:21050] tpch> show column stats lineitem;
Query: show column stats lineitem
+-----------------+---------------+------------------+--------+----------+---------------+--------+---------+
| Column          | Type          | #Distinct Values | #Nulls | Max Size | Avg 
Size      | #Trues | #Falses |
+-----------------+---------------+------------------+--------+----------+---------------+--------+---------+
| l_orderkey      | BIGINT        | 1563438          | 0      | 8        | 8.0  
         | -1     | -1      |
| l_partkey       | BIGINT        | 200516           | 0      | 8        | 8.0  
         | -1     | -1      |
| l_suppkey       | BIGINT        | 9712             | 0      | 8        | 8.0  
         | -1     | -1      |
| l_linenumber    | INT           | 7                | 0      | 4        | 4.0  
         | -1     | -1      |
| l_quantity      | DECIMAL(12,2) | 51               | 0      | 8        | 8.0  
         | -1     | -1      |
| l_extendedprice | DECIMAL(12,2) | 868550           | 0      | 8        | 8.0  
         | -1     | -1      |
| l_discount      | DECIMAL(12,2) | 11               | 0      | 8        | 8.0  
         | -1     | -1      |
| l_tax           | DECIMAL(12,2) | 9                | 0      | 8        | 8.0  
         | -1     | -1      |
| l_returnflag    | STRING        | 3                | 0      | 1        | 1.0  
         | -1     | -1      |
| l_linestatus    | STRING        | 2                | 0      | 1        | 1.0  
         | -1     | -1      |
| l_shipdate      | STRING        | 2629             | 0      | 10       | 10.0 
         | -1     | -1      |
| l_commitdate    | STRING        | 2559             | 0      | 10       | 10.0 
         | -1     | -1      |
| l_receiptdate   | STRING        | 2658             | 0      | 10       | 10.0 
         | -1     | -1      |
| l_shipinstruct  | STRING        | 4                | 0      | 17       | 
11.9986381531 | -1     | -1      |
| l_shipmode      | STRING        | 7                | 0      | 7        | 
4.28530454636 | -1     | -1      |
| l_comment       | STRING        | 4652621          | 0      | 43       | 
26.4941692352 | -1     | -1      |
+-----------------+---------------+------------------+--------+----------+---------------+--------+---------+
{noformat}

The reason for this overestimation is the group by exprs contain duplicate 
column l_shipdate: 
{noformat}
group by: l_shipdate, l_quantity, l_shipdate       
{noformat}

Due to this extra duplicate column, we compute the cardinality as 
NDV(l_shipdate) * NDV(l_quantity)  * NDV(l_shipdate) = 352M  which is then 
capped at 6M since the child produces 6M rows. 

The right estimate should be NDV(l_shipdate) *  NDV(l_quantity) = 134K






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to