[ 
https://issues.apache.org/jira/browse/HIVE-27876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17789050#comment-17789050
 ] 

Krisztian Kasa commented on HIVE-27876:
---------------------------------------

I found another data correctness issue regarding this optimization:
{code}
create table test_bucket(age int, name string, dept string) clustered by (age) 
sorted by (age asc) into 2 buckets stored as orc;

insert into test_bucket values (10, 'user1', 'dept1'), (10, 'user2' , 'dept2'), 
( 2, 'user2' , 'dept2');
insert into test_bucket values (1, 'user1', 'dept1'), ( 2, 'user2' , 'dept2');

select * from test_bucket;
{code}
Order is not global:
{code}
2       user2   dept2
10      user1   dept1
10      user2   dept2
2       user2   dept2
1       user1   dept1
{code}
{code}
select age, count(*) from test_bucket group by age;
{code}
Records with key {{age = 2}} are bit aggregated:
{code}
2       1
10      2
2       1
1       1
{code}
First insert creates one file:
{code}
itests/qtest/target/localfs/warehouse/test_bucket/000000_0
{"age":2,"name":"user2","dept":"dept2"}
{"age":10,"name":"user1","dept":"dept1"}
{"age":10,"name":"user2","dept":"dept2"}
{code}

Second insert creates 2 files:
{code}
itests/qtest/target/localfs/warehouse/test_bucket/000000_0_copy_1
{"age":2,"name":"user2","dept":"dept2"}

itests/qtest/target/localfs/warehouse/test_bucket/000001_0
{"age":1,"name":"user1","dept":"dept1"}
{code}

> Incorrect query results on tables with ClusterBy & SortBy
> ---------------------------------------------------------
>
>                 Key: HIVE-27876
>                 URL: https://issues.apache.org/jira/browse/HIVE-27876
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Naresh P R
>            Assignee: Ramesh Kumar Thangarajan
>            Priority: Major
>              Labels: pull-request-available
>
> Repro:
>  
> {code:java}
> create external table test_bucket(age int, name string, dept string) 
> clustered by (age, name) sorted by (age asc, name asc) into 2 buckets stored 
> as orc;
> insert into test_bucket values (1, 'user1', 'dept1'), ( 2, 'user2' , 'dept2');
> insert into test_bucket values (1, 'user1', 'dept1'), ( 2, 'user2' , 'dept2');
> //empty wrong results
> select age, name, count(*) from test_bucket group by  age, name having 
> count(*) > 1; 
> +------+-------+------+
> | age  | name  | _c2  |
> +------+-------+------+
> +------+-------+------+
> // Workaround
> set hive.map.aggr=false;
> select age, name, count(*) from test_bucket group by  age, name having 
> count(*) > 1; 
> +------+--------+------+
> | age  |  name  | _c2  |
> +------+--------+------+
> | 1    | user1  | 2    |
> | 2    | user2  | 2    |
> +------+--------+------+ {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to