[ https://issues.apache.org/jira/browse/HIVE-28213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842956#comment-17842956 ]
Krisztian Kasa commented on HIVE-28213: --------------------------------------- IMHO {{hive.tez.bucket.pruning}} shouldn't be allowed while scanning external tables. It relies on the filenames: [https://github.com/apache/hive/blob/636b0d3abf00afbe2cf71dc89f762acca48867ca/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java#L339] [https://github.com/apache/hive/blob/636b0d3abf00afbe2cf71dc89f762acca48867ca/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L2077] Datafiles belongs to external tables are allowed to changed without Hive hence this optimization can lead to data correctness issues. > Incorrect results after insert-select from similar bucketed source & target > table > --------------------------------------------------------------------------------- > > Key: HIVE-28213 > URL: https://issues.apache.org/jira/browse/HIVE-28213 > Project: Hive > Issue Type: Bug > Reporter: Naresh P R > Assignee: Zoltán Rátkai > Priority: Major > Attachments: test.q > > > Insert-select is not honoring bucketing if both source & target are bucketed > on same column. > eg., > {code:java} > CREATE EXTERNAL TABLE bucketing_table1 (id INT) > CLUSTERED BY (id) > SORTED BY (id ASC) > INTO 32 BUCKETS stored as textfile; > INSERT INTO TABLE bucketing_table1 VALUES (1), (2), (3), (4), (5); > CREATE EXTERNAL TABLE bucketing_table2 like bucketing_table1; > INSERT INTO TABLE bucketing_table2 select * from bucketing_table1;{code} > id=1 => murmur_hash(1) %32 should go to 29th bucket file. > bucketing_table1 has id=1 at 29th file, > but bucketing_table2 doesn't have 29th file because Insert-select dint honor > the bucketing. > {code:java} > SELECT count(*) FROM bucketing_table1 WHERE id = 1; > === > 1 //correct result > SELECT count(*) FROM bucketing_table2 WHERE id = 1; > === > 0 // incorrect result > select *, INPUT__FILE__NAME from bucketing_table1; > +----------------------+----------------------------------------------------+ > | bucketing_table1.id | input__file__name | > +----------------------+----------------------------------------------------+ > | 2 | <warehouse>/bucketing_table1/000004_0 | > | 3 | <warehouse>/bucketing_table1/000006_0 | > | 5 | <warehouse>/bucketing_table1/000015_0 | > | 4 | <warehouse>/bucketing_table1/000021_0 | > | 1 | <warehouse>/bucketing_table1/000029_0 | > +----------------------+----------------------------------------------------+ > select *, INPUT__FILE__NAME from bucketing_table2; > +-------------+----------------------------------------------------+ > | bucketing_table2.id | input__file__name | > +-------------+----------------------------------------------------+ > | 2 | <warehouse>/bucketing_table2/000000_0 | > | 3 | <warehouse>/bucketing_table2/000001_0 | > | 5 | <warehouse>/bucketing_table2/000002_0 | > | 4 | <warehouse>/bucketing_table2/000003_0 | > | 1 | <warehouse>/bucketing_table2/000004_0 | > +----------------------+----------------------------------------------------+{code} > Workaround for read: hive.tez.bucket.pruning=false; > PS: Attaching repro file [^test.q] -- This message was sent by Atlassian Jira (v8.20.10#820010)