[ 
https://issues.apache.org/jira/browse/HIVE-23485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stamatis Zampetakis updated HIVE-23485:
---------------------------------------
    Status: Patch Available  (was: Open)

Add a lower bound for the cardinality of the GroupByOperator by using the 
largest known NDV among the columns present in the aggregation when there are 
no grouping sets.

More precise stats could also be derived when grouping sets are present but we 
shall do it as part of another jira.

Update LLAP plans with the estimation changes. More plans to be fixed in 
follow-up commits.


> Bound GroupByOperator stats using largest NDV among columns
> -----------------------------------------------------------
>
>                 Key: HIVE-23485
>                 URL: https://issues.apache.org/jira/browse/HIVE-23485
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Stamatis Zampetakis
>            Assignee: Stamatis Zampetakis
>            Priority: Major
>         Attachments: HIVE-23485.01.patch
>
>
> Consider the following SQL query:
> {code:sql}
> select id, name from person group by id, name;
> {code}
> and assume that the person table contains the following tuples:
> {code:sql}
> insert into person values (0, 'A') ;
> insert into person values (1, 'A') ;
> insert into person values (2, 'B') ;
> insert into person values (3, 'B') ;
> insert into person values (4, 'B') ;
> insert into person values (5, 'C') ;
> {code}
> If we know the number of distinct values (NDV) for all columns in the group 
> by clause then we can infer a lower bound for the total number of rows by 
> taking the maximun NDV of the involved columns. 
> Currently the query in the scenario above has the following plan:
> {noformat}
> Vertex dependency in root stage
> Reducer 2 <- Map 1 (SIMPLE_EDGE)
> Stage-0
>   Fetch Operator
>     limit:-1
>     Stage-1
>       Reducer 2 vectorized
>       File Output Operator [FS_11]
>         Group By Operator [GBY_10] (rows=3 width=92)
>           Output:["_col0","_col1"],keys:KEY._col0, KEY._col1
>         <-Map 1 [SIMPLE_EDGE] vectorized
>           SHUFFLE [RS_9]
>             PartitionCols:_col0, _col1
>             Group By Operator [GBY_8] (rows=3 width=92)
>               Output:["_col0","_col1"],keys:id, name
>               Select Operator [SEL_7] (rows=6 width=92)
>                 Output:["id","name"]
>                 TableScan [TS_0] (rows=6 width=92)
>                   
> default@person,person,Tbl:COMPLETE,Col:COMPLETE,Output:["id","name"]{noformat}
> Observe that the stats for group by report 3 rows but given that the ID 
> attribute is part of the aggregation the rows cannot be less than 6.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to