[ https://issues.apache.org/jira/browse/HIVE-22098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
GuangMing Lu updated HIVE-22098: -------------------------------- Attachment: join_test.sql > Data loss occurs when multiple tables are join with different bucket_version > ---------------------------------------------------------------------------- > > Key: HIVE-22098 > URL: https://issues.apache.org/jira/browse/HIVE-22098 > Project: Hive > Issue Type: Bug > Components: Operators > Affects Versions: 3.1.0, 3.1.2 > Reporter: GuangMing Lu > Assignee: yongtaoliao > Priority: Blocker > Labels: data-loss, wrongresults > Attachments: HIVE-22098.1.patch, image-2019-08-12-18-45-15-771.png, > join_test.sql, table_a_data.orc, table_b_data.orc, table_c_data.orc > > > When different bucketVersion of tables do join and no of reducers is greater > than 2, the result is incorrect (*data loss*). > *Scenario 1*: Three tables join. The temporary result data of table_a in the > first table and table_b in the second table joins result is recorded as > tmp_a_b, When it joins with the third table, the bucket_version=2 of the > table created by default after hive-3.0.0, temporary data tmp_a_b initialized > the bucketVerison=-1, and then ReduceSinkOperator Verketison=-1 is joined. In > the init method, the hash algorithm of selecting join column is selected > according to bucketVersion. If bucketVersion = 2 and is not an acid > operation, it will acquired the new algorithm of hash. Otherwise, the old > algorithm of hash is acquired. Because of the inconsistency of the algorithm > of hash, the partition of data allocation caused are different. At stage of > Reducer, Data with the same key can not be paired resulting in data loss. > *Scenario 2*: create two test tables, create table > table_bucketversion_1(col_1 string, col_2 string) TBLPROPERTIES > ('bucketing_version'='1'); table_bucketversion_2(col_1 string, col_2 string) > TBLPROPERTIES ('bucketing_version'='2'); > when use table_bucketversion_1 to join table_bucketversion_2, partial result > data will be loss due to bucketVerison is different. > -- This message was sent by Atlassian Jira (v8.3.4#803005)