[jira] [Commented] (HIVE-22098) Data loss occurs when multiple tables are join with different bucket_version

LuGuangMing (JIRA) Mon, 12 Aug 2019 03:54:23 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-22098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16905063#comment-16905063
 ]


LuGuangMing commented on HIVE-22098:
------------------------------------

Modifying the hashcode algorithm to get key is consistent, and using the 
recommended getBucketHashCode to avoid such problems. To maintain a unified 
hash algorithm, first of all, we need to ensure that the bucket Version of the 
table is consistent.

!image-2019-08-12-18-45-15-771.png!

bucketVersion=-1(default) join bucketVersion=1, result consistent is 1 to 
compatible old table hash algorithm.

bucketVersion=-1(default) join bucketVersion=2,result consistent is 2 to use 
new hash algorithm.

bucketVersion=1 join bucketVersion=2, result consistent is 2 to use new hash 
algorithm, for old table could be join with new table. HIVE-21167 .  HIVE-18910

> Data loss occurs when multiple tables are join with different bucket_version
> ----------------------------------------------------------------------------
>
>                 Key: HIVE-22098
>                 URL: https://issues.apache.org/jira/browse/HIVE-22098
>             Project: Hive
>          Issue Type: Bug
>          Components: Operators
>    Affects Versions: 3.1.0
>            Reporter: LuGuangMing
>            Assignee: LuGuangMing
>            Priority: Major
>         Attachments: image-2019-08-12-18-45-15-771.png, join_test.sql, 
> table_a_data.orc, table_b_data.orc, table_c_data.orc
>
>
> When different bucketVersion of tables do join and  reducers number greater 
> than 2, result is easy to lose data.
> *Scenario 1*: Three tables join. The temporary result data of table_a in the 
> first table and table_b in the second table joins result is recorded as 
> tmp_a_b, When it joins with the third table, the bucket_version=2 of the 
> table created by default after hive-3.0.0, temporary data tmp_a_b initialized 
> the bucketVerison=-1, and then ReduceSinkOperator Verketison=-1 is joined. In 
> the init method, the hash algorithm of selecting join column is selected 
> according to bucketVersion. If bucketVersion = 2 and is not an acid 
> operation, it will acquired the new algorithm of hash. Otherwise, the old 
> algorithm of hash is acquired. Because of the inconsistency of the algorithm 
> of hash, the partition of data allocation caused are different. At stage of 
> Reducer, Data with the same key can not be paired resulting in data loss.
> *Scenario 2*: create two test tables, create table 
> table_bucketversion_1(col_1 string, col_2 string) TBLPROPERTIES 
> ('bucketing_version'='1'); table_bucketversion_2(col_1 string, col_2 string) 
> TBLPROPERTIES ('bucketing_version'='2');
> when use table_bucketversion_1 to join table_bucketversion_2, partial result 
> data will be loss due to bucketVerison is different.
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (HIVE-22098) Data loss occurs when multiple tables are join with different bucket_version

Reply via email to