[ 
https://issues.apache.org/jira/browse/HIVE-23485?focusedWorklogId=543657&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-543657
 ]

ASF GitHub Bot logged work on HIVE-23485:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 28/Jan/21 13:56
            Start Date: 28/Jan/21 13:56
    Worklog Time Spent: 10m 
      Work Description: zabetak opened a new pull request #1926:
URL: https://github.com/apache/hive/pull/1926


   ### What changes were proposed in this pull request?
   Update estimations for group by operator to take into account the largest 
NDV among the columns participating in the aggregation.
   
   ### Why are the changes needed?
   Improve accuracy of statistics.
   
   ### Does this PR introduce _any_ user-facing change?
   May result in plan changes.
   
   ### How was this patch tested?
   ```
   mvn -pl itests/qtest -Pqsplits -Pitests test 
-Dtest=TestMiniLlapLocalCliDriver -Dtest.output.overwrite
   mvn -pl itests/qtest -Pqsplits -Pitests test 
-Dtest=TestTezTPCDS30TBPerfCliDriver -Dtest.output.overwrite
   
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

            Worklog Id:     (was: 543657)
    Remaining Estimate: 0h
            Time Spent: 10m

> Bound GroupByOperator stats using largest NDV among columns
> -----------------------------------------------------------
>
>                 Key: HIVE-23485
>                 URL: https://issues.apache.org/jira/browse/HIVE-23485
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Stamatis Zampetakis
>            Assignee: Stamatis Zampetakis
>            Priority: Major
>         Attachments: HIVE-23485.01.patch, HIVE-23485.02.patch
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Consider the following SQL query:
> {code:sql}
> select id, name from person group by id, name;
> {code}
> and assume that the person table contains the following tuples:
> {code:sql}
> insert into person values (0, 'A') ;
> insert into person values (1, 'A') ;
> insert into person values (2, 'B') ;
> insert into person values (3, 'B') ;
> insert into person values (4, 'B') ;
> insert into person values (5, 'C') ;
> {code}
> If we know the number of distinct values (NDV) for all columns in the group 
> by clause then we can infer a lower bound for the total number of rows by 
> taking the maximun NDV of the involved columns. 
> Currently the query in the scenario above has the following plan:
> {noformat}
> Vertex dependency in root stage
> Reducer 2 <- Map 1 (SIMPLE_EDGE)
> Stage-0
>   Fetch Operator
>     limit:-1
>     Stage-1
>       Reducer 2 vectorized
>       File Output Operator [FS_11]
>         Group By Operator [GBY_10] (rows=3 width=92)
>           Output:["_col0","_col1"],keys:KEY._col0, KEY._col1
>         <-Map 1 [SIMPLE_EDGE] vectorized
>           SHUFFLE [RS_9]
>             PartitionCols:_col0, _col1
>             Group By Operator [GBY_8] (rows=3 width=92)
>               Output:["_col0","_col1"],keys:id, name
>               Select Operator [SEL_7] (rows=6 width=92)
>                 Output:["id","name"]
>                 TableScan [TS_0] (rows=6 width=92)
>                   
> default@person,person,Tbl:COMPLETE,Col:COMPLETE,Output:["id","name"]{noformat}
> Observe that the stats for group by report 3 rows but given that the ID 
> attribute is part of the aggregation the rows cannot be less than 6.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to