[jira] [Commented] (HIVE-26472) Concurrent UPDATEs can cause duplicate rows

John Sherman (Jira) Tue, 16 Aug 2022 18:46:05 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-26472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17580529#comment-17580529
 ]


John Sherman commented on HIVE-26472:
-------------------------------------

The current purposed patch approach is to always re-allocate writeIds during 
recompilation when the transaction is not out of date (so the transaction gets 
re-used). If the transaction is out of date, this is already handled by a full 
rollback and open of a transaction.

This involves clearing the writeId cache and adding an optional reallocate 
boolean to allocateTableWriteIds (that defaults to false). The reason the 
reallocate boolean is required, is that allocateTableWriteIds will return the 
previously acquired writeIds for the transaction otherwise. So we need to 
delete the old assigned writeIds and associate new ones with the txn.

> Concurrent UPDATEs can cause duplicate rows
> -------------------------------------------
>
>                 Key: HIVE-26472
>                 URL: https://issues.apache.org/jira/browse/HIVE-26472
>             Project: Hive
>          Issue Type: Bug
>          Components: HiveServer2
>    Affects Versions: 4.0.0-alpha-1
>            Reporter: John Sherman
>            Assignee: John Sherman
>            Priority: Critical
>              Labels: pull-request-available
>         Attachments: debug.diff
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Concurrent UPDATEs to the same table can cause duplicate rows when the 
> following occurs:
> Two UPDATEs get assigned txnIds and writeIds like this:
> UPDATE #1 = txnId: 100 writeId: 50 <--- commits first
> UPDATE #2 = txnId: 101 writeId: 49
> To replicate the issue:
> I applied the attach debug.diff patch which adds hive.lock.sleep.writeid 
> (which controls the amount to sleep before acquiring a writeId) and 
> hive.lock.sleep.post.writeid (which controls the amount to sleep after 
> acquiring a writeId).
> {code:java}
> CREATE TABLE test_update(i int) STORED AS ORC 
> TBLPROPERTIES('transactional'="true");
> INSERT INTO test_update VALUES (1);
> Start two beeline connections.
> In connection #1 - run:
> set hive.driver.parallel.compilation = true;
> set hive.lock.sleep.writeid=5s;
> update test_update set i = 1 where i = 1;
> Wait one second and in connection #2 - run:
> set hive.driver.parallel.compilation = true;
> set hive.lock.sleep.post.writeid=10s;
> update test_update set i = 1 where i = 1;
> After both updates complete - it is likely that test_update contains two rows 
> now.
> {code}
> HIVE-24211 seems to address the case when:
> UPDATE #1 = txnId: 100 writeId: 50
> UPDATE #2 = txnId: 101 writeId: 49 <--- commits first (I think this causes 
> UPDATE #1 to detect the snapshot is out of date because commitedTxn > UPDATE 
> #1s txnId)
> A possible work around is to set hive.driver.parallel.compilation = false, 
> but this would only help in cases there is only one HS2 instance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-26472) Concurrent UPDATEs can cause duplicate rows

Reply via email to