[ 
https://issues.apache.org/jira/browse/HIVE-27876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17788476#comment-17788476
 ] 

Ramesh Kumar Thangarajan commented on HIVE-27876:
-------------------------------------------------

[~kkasa] I was looking into fixing this. But I had 2 questions to think about:
1. Should we expect the data to be bucketed and sorted globally after the 
inserts? Because if all the 4 rows are inserted in the table in a single 
statement query like below, then I guess the optimization works fine.
insert into test_bucket values (1, 'user1', 'dept1'), ( 2, 'user2' , 'dept2'), 
(1, 'user1', 'dept1'), ( 2, 'user2' , 'dept2');
2. Having map side group by is still useful even if the data is sorted locally 
within a bucket, it is only a problem when we remove the ReduceSinkOperator. In 
that case, can we just skip removing ReduceSinkOperator as part of the 
optimization. Will we still get any real improvements as part of this 
optimization(even after skipping to remove ReduceSinkOperator)?

 

> Incorrect query results on tables with ClusterBy & SortBy
> ---------------------------------------------------------
>
>                 Key: HIVE-27876
>                 URL: https://issues.apache.org/jira/browse/HIVE-27876
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Naresh P R
>            Assignee: Ramesh Kumar Thangarajan
>            Priority: Major
>
> Repro:
>  
> {code:java}
> create external table test_bucket(age int, name string, dept string) 
> clustered by (age, name) sorted by (age asc, name asc) into 2 buckets stored 
> as orc;
> insert into test_bucket values (1, 'user1', 'dept1'), ( 2, 'user2' , 'dept2');
> insert into test_bucket values (1, 'user1', 'dept1'), ( 2, 'user2' , 'dept2');
> //empty wrong results
> select age, name, count(*) from test_bucket group by  age, name having 
> count(*) > 1; 
> +------+-------+------+
> | age  | name  | _c2  |
> +------+-------+------+
> +------+-------+------+
> // Workaround
> set hive.map.aggr=false;
> select age, name, count(*) from test_bucket group by  age, name having 
> count(*) > 1; 
> +------+--------+------+
> | age  |  name  | _c2  |
> +------+--------+------+
> | 1    | user1  | 2    |
> | 2    | user2  | 2    |
> +------+--------+------+ {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to