[ https://issues.apache.org/jira/browse/HIVE-27876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17788031#comment-17788031 ]
Krisztian Kasa commented on HIVE-27876: --------------------------------------- The query in the description has a plan: {code} POSTHOOK: query: explain select age, name, count(*) from test_bucket group by age, name having count(*) > 1 POSTHOOK: type: QUERY POSTHOOK: Input: default@test_bucket #### A masked pattern was here #### STAGE DEPENDENCIES: Stage-0 is a root stage STAGE PLANS: Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: TableScan alias: test_bucket Select Operator expressions: age (type: int), name (type: string) outputColumnNames: age, name Group By Operator aggregations: count() keys: age (type: int), name (type: string) mode: final outputColumnNames: _col0, _col1, _col2 Filter Operator predicate: (_col2 > 1L) (type: boolean) ListSink {code} In this case 2 bucket files are created. Both are sorted but only at file level. The records are fetched this order by FetchOperator {code} 1 user1 dept1 2 user2 dept2 1 user1 dept1 2 user2 dept2 {code} Data is not sorted globally and group by operator treats all {{age, name}} column values as distinct values hence {{count( * )}} is 1 for all the key values then Filter operator filters out all records. Possible workaround to turn off map side group by optimization https://github.com/apache/hive/blob/feda35389dc28c8c9bf3c8a3d39de53ba90e41c0/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L2019-L2022 {code} set hive.map.groupby.sorted=false; {code} > Incorrect query results on tables with ClusterBy & SortBy > --------------------------------------------------------- > > Key: HIVE-27876 > URL: https://issues.apache.org/jira/browse/HIVE-27876 > Project: Hive > Issue Type: Bug > Reporter: Naresh P R > Priority: Major > > Repro: > > {code:java} > create external table test_bucket(age int, name string, dept string) > clustered by (age, name) sorted by (age asc, name asc) into 2 buckets stored > as orc; > insert into test_bucket values (1, 'user1', 'dept1'), ( 2, 'user2' , 'dept2'); > insert into test_bucket values (1, 'user1', 'dept1'), ( 2, 'user2' , 'dept2'); > //empty wrong results > select age, name, count(*) from test_bucket group by age, name having > count(*) > 1; > +------+-------+------+ > | age | name | _c2 | > +------+-------+------+ > +------+-------+------+ > // Workaround > set hive.map.aggr=false; > select age, name, count(*) from test_bucket group by age, name having > count(*) > 1; > +------+--------+------+ > | age | name | _c2 | > +------+--------+------+ > | 1 | user1 | 2 | > | 2 | user2 | 2 | > +------+--------+------+ {code} > > -- This message was sent by Atlassian Jira (v8.20.10#820010)