[ 
https://issues.apache.org/jira/browse/HIVE-27876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17788031#comment-17788031
 ] 

Krisztian Kasa commented on HIVE-27876:
---------------------------------------

The query in the description has a plan:
{code}
POSTHOOK: query: explain
select age, name, count(*) from test_bucket group by  age, name having count(*) 
> 1
POSTHOOK: type: QUERY
POSTHOOK: Input: default@test_bucket
#### A masked pattern was here ####
STAGE DEPENDENCIES:
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        TableScan
          alias: test_bucket
          Select Operator
            expressions: age (type: int), name (type: string)
            outputColumnNames: age, name
            Group By Operator
              aggregations: count()
              keys: age (type: int), name (type: string)
              mode: final
              outputColumnNames: _col0, _col1, _col2
              Filter Operator
                predicate: (_col2 > 1L) (type: boolean)
                ListSink
{code}

In this case 2 bucket files are created. Both are sorted but only at file 
level. The records are fetched this order by FetchOperator 
{code}
1       user1   dept1
2       user2   dept2
1       user1   dept1
2       user2   dept2
{code}
Data is not sorted globally and group by operator treats all {{age, name}} 
column values as distinct values hence {{count( * )}} is 1 for all the key 
values then Filter operator filters out all records.

Possible workaround to turn off map side group by optimization
https://github.com/apache/hive/blob/feda35389dc28c8c9bf3c8a3d39de53ba90e41c0/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L2019-L2022
{code}
set hive.map.groupby.sorted=false;
{code}

> Incorrect query results on tables with ClusterBy & SortBy
> ---------------------------------------------------------
>
>                 Key: HIVE-27876
>                 URL: https://issues.apache.org/jira/browse/HIVE-27876
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Naresh P R
>            Priority: Major
>
> Repro:
>  
> {code:java}
> create external table test_bucket(age int, name string, dept string) 
> clustered by (age, name) sorted by (age asc, name asc) into 2 buckets stored 
> as orc;
> insert into test_bucket values (1, 'user1', 'dept1'), ( 2, 'user2' , 'dept2');
> insert into test_bucket values (1, 'user1', 'dept1'), ( 2, 'user2' , 'dept2');
> //empty wrong results
> select age, name, count(*) from test_bucket group by  age, name having 
> count(*) > 1; 
> +------+-------+------+
> | age  | name  | _c2  |
> +------+-------+------+
> +------+-------+------+
> // Workaround
> set hive.map.aggr=false;
> select age, name, count(*) from test_bucket group by  age, name having 
> count(*) > 1; 
> +------+--------+------+
> | age  |  name  | _c2  |
> +------+--------+------+
> | 1    | user1  | 2    |
> | 2    | user2  | 2    |
> +------+--------+------+ {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to