[ https://issues.apache.org/jira/browse/HIVE-22098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16905063#comment-16905063 ]
LuGuangMing commented on HIVE-22098: ------------------------------------ Modifying the hashcode algorithm to get key is consistent, and using the recommended getBucketHashCode to avoid such problems. To maintain a unified hash algorithm, first of all, we need to ensure that the bucket Version of the table is consistent. !image-2019-08-12-18-45-15-771.png! bucketVersion=-1(default) join bucketVersion=1, result consistent is 1 to compatible old table hash algorithm. bucketVersion=-1(default) join bucketVersion=2,result consistent is 2 to use new hash algorithm. bucketVersion=1 join bucketVersion=2, result consistent is 2 to use new hash algorithm, for old table could be join with new table. HIVE-21167 . HIVE-18910 > Data loss occurs when multiple tables are join with different bucket_version > ---------------------------------------------------------------------------- > > Key: HIVE-22098 > URL: https://issues.apache.org/jira/browse/HIVE-22098 > Project: Hive > Issue Type: Bug > Components: Operators > Affects Versions: 3.1.0 > Reporter: LuGuangMing > Assignee: LuGuangMing > Priority: Major > Attachments: image-2019-08-12-18-45-15-771.png, join_test.sql, > table_a_data.orc, table_b_data.orc, table_c_data.orc > > > When different bucketVersion of tables do join and reducers number greater > than 2, result is easy to lose data. > *Scenario 1*: Three tables join. The temporary result data of table_a in the > first table and table_b in the second table joins result is recorded as > tmp_a_b, When it joins with the third table, the bucket_version=2 of the > table created by default after hive-3.0.0, temporary data tmp_a_b initialized > the bucketVerison=-1, and then ReduceSinkOperator Verketison=-1 is joined. In > the init method, the hash algorithm of selecting join column is selected > according to bucketVersion. If bucketVersion = 2 and is not an acid > operation, it will acquired the new algorithm of hash. Otherwise, the old > algorithm of hash is acquired. Because of the inconsistency of the algorithm > of hash, the partition of data allocation caused are different. At stage of > Reducer, Data with the same key can not be paired resulting in data loss. > *Scenario 2*: create two test tables, create table > table_bucketversion_1(col_1 string, col_2 string) TBLPROPERTIES > ('bucketing_version'='1'); table_bucketversion_2(col_1 string, col_2 string) > TBLPROPERTIES ('bucketing_version'='2'); > when use table_bucketversion_1 to join table_bucketversion_2, partial result > data will be loss due to bucketVerison is different. > -- This message was sent by Atlassian JIRA (v7.6.14#76016)