[jira] [Assigned] (HIVE-23954) count(*) with count(distinct) gives wrong results with hive.optimize.countdistinct=true

Eugene Chung (Jira) Fri, 07 Aug 2020 02:39:27 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-23954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Eugene Chung reassigned HIVE-23954:
-----------------------------------

    Assignee: Eugene Chung

> count(*) with count(distinct) gives wrong results with 
> hive.optimize.countdistinct=true
> ---------------------------------------------------------------------------------------
>
>                 Key: HIVE-23954
>                 URL: https://issues.apache.org/jira/browse/HIVE-23954
>             Project: Hive
>          Issue Type: Bug
>          Components: Logical Optimizer
>    Affects Versions: 3.0.0, 3.1.0
>            Reporter: Eugene Chung
>            Assignee: Eugene Chung
>            Priority: Major
>         Attachments: HIVE-23954.01.patch
>
>
> {code:java}
> select count(*), count(distinct mid) from db1.table1 where partitioned_column 
> = '...'{code}
>  
> is not working properly when hive.optimize.countdistinct is true. By default, 
> it's true for all 3.x versions.
> In the two plans below, the aggregations part in the Output of Group By 
> Operator of Map 1 are different.
>  
> - hive.optimize.countdistinct=false
> {code:java}
> +----------------------------------------------------+
> |                      Explain                       |
> +----------------------------------------------------+
> | Plan optimized by CBO.                             |
> |                                                    |
> | Vertex dependency in root stage                    |
> | Reducer 2 <- Map 1 (SIMPLE_EDGE)                   |
> |                                                    |
> | Stage-0                                            |
> |   Fetch Operator                                   |
> |     limit:-1                                       |
> |     Stage-1                                        |
> |       Reducer 2                                    |
> |       File Output Operator [FS_7]                  |
> |         Group By Operator [GBY_5] (rows=1 width=24) |
> |           
> Output:["_col0","_col1"],aggregations:["count(VALUE._col0)","count(DISTINCT 
> KEY._col0:0._col0)"] |
> |         <-Map 1 [SIMPLE_EDGE]                      |
> |           SHUFFLE [RS_4]                           |
> |             Group By Operator [GBY_3] (rows=343640771 width=4160) |
> |               
> Output:["_col0","_col1","_col2"],aggregations:["count()","count(DISTINCT 
> mid)"],keys:mid |
> |               Select Operator [SEL_2] (rows=343640771 width=4160) |
> |                 Output:["mid"]                     |
> |                 TableScan [TS_0] (rows=343640771 width=4160) |
> |                   db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] |
> |                                                    |
> +----------------------------------------------------+{code}
>  
> - hive.optimize.countdistinct=true
> {code:java}
> +----------------------------------------------------+
> |                      Explain                       |
> +----------------------------------------------------+
> | Plan optimized by CBO.                             |
> |                                                    |
> | Vertex dependency in root stage                    |
> | Reducer 2 <- Map 1 (SIMPLE_EDGE)                   |
> |                                                    |
> | Stage-0                                            |
> |   Fetch Operator                                   |
> |     limit:-1                                       |
> |     Stage-1                                        |
> |       Reducer 2                                    |
> |       File Output Operator [FS_7]                  |
> |         Group By Operator [GBY_14] (rows=1 width=16) |
> |           
> Output:["_col0","_col1"],aggregations:["count(_col1)","count(_col0)"] |
> |           Group By Operator [GBY_11] (rows=343640771 width=4160) |
> |             
> Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 |
> |           <-Map 1 [SIMPLE_EDGE]                    |
> |             SHUFFLE [RS_10]                        |
> |               PartitionCols:_col0                  |
> |               Group By Operator [GBY_9] (rows=343640771 width=4160) |
> |                 Output:["_col0","_col1"],aggregations:["count()"],keys:mid |
> |                 Select Operator [SEL_2] (rows=343640771 width=4160) |
> |                   Output:["mid"]                   |
> |                   TableScan [TS_0] (rows=343640771 width=4160) |
> |                     db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] |
> |                                                    |
> +----------------------------------------------------+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HIVE-23954) count(*) with count(distinct) gives wrong results with hive.optimize.countdistinct=true

Reply via email to