[ https://issues.apache.org/jira/browse/HIVE-23954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Eugene Chung reassigned HIVE-23954: ----------------------------------- Assignee: Eugene Chung > count(*) with count(distinct) gives wrong results with > hive.optimize.countdistinct=true > --------------------------------------------------------------------------------------- > > Key: HIVE-23954 > URL: https://issues.apache.org/jira/browse/HIVE-23954 > Project: Hive > Issue Type: Bug > Components: Logical Optimizer > Affects Versions: 3.0.0, 3.1.0 > Reporter: Eugene Chung > Assignee: Eugene Chung > Priority: Major > Attachments: HIVE-23954.01.patch > > > {code:java} > select count(*), count(distinct mid) from db1.table1 where partitioned_column > = '...'{code} > > is not working properly when hive.optimize.countdistinct is true. By default, > it's true for all 3.x versions. > In the two plans below, the aggregations part in the Output of Group By > Operator of Map 1 are different. > > - hive.optimize.countdistinct=false > {code:java} > +----------------------------------------------------+ > | Explain | > +----------------------------------------------------+ > | Plan optimized by CBO. | > | | > | Vertex dependency in root stage | > | Reducer 2 <- Map 1 (SIMPLE_EDGE) | > | | > | Stage-0 | > | Fetch Operator | > | limit:-1 | > | Stage-1 | > | Reducer 2 | > | File Output Operator [FS_7] | > | Group By Operator [GBY_5] (rows=1 width=24) | > | > Output:["_col0","_col1"],aggregations:["count(VALUE._col0)","count(DISTINCT > KEY._col0:0._col0)"] | > | <-Map 1 [SIMPLE_EDGE] | > | SHUFFLE [RS_4] | > | Group By Operator [GBY_3] (rows=343640771 width=4160) | > | > Output:["_col0","_col1","_col2"],aggregations:["count()","count(DISTINCT > mid)"],keys:mid | > | Select Operator [SEL_2] (rows=343640771 width=4160) | > | Output:["mid"] | > | TableScan [TS_0] (rows=343640771 width=4160) | > | db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] | > | | > +----------------------------------------------------+{code} > > - hive.optimize.countdistinct=true > {code:java} > +----------------------------------------------------+ > | Explain | > +----------------------------------------------------+ > | Plan optimized by CBO. | > | | > | Vertex dependency in root stage | > | Reducer 2 <- Map 1 (SIMPLE_EDGE) | > | | > | Stage-0 | > | Fetch Operator | > | limit:-1 | > | Stage-1 | > | Reducer 2 | > | File Output Operator [FS_7] | > | Group By Operator [GBY_14] (rows=1 width=16) | > | > Output:["_col0","_col1"],aggregations:["count(_col1)","count(_col0)"] | > | Group By Operator [GBY_11] (rows=343640771 width=4160) | > | > Output:["_col0","_col1"],aggregations:["count(VALUE._col0)"],keys:KEY._col0 | > | <-Map 1 [SIMPLE_EDGE] | > | SHUFFLE [RS_10] | > | PartitionCols:_col0 | > | Group By Operator [GBY_9] (rows=343640771 width=4160) | > | Output:["_col0","_col1"],aggregations:["count()"],keys:mid | > | Select Operator [SEL_2] (rows=343640771 width=4160) | > | Output:["mid"] | > | TableScan [TS_0] (rows=343640771 width=4160) | > | db1@table1,table1,Tbl:COMPLETE,Col:NONE,Output:["mid"] | > | | > +----------------------------------------------------+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)