[ 
https://issues.apache.org/jira/browse/HIVE-25258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-25258:
----------------------------------
    Labels: pull-request-available  (was: )

> Incorrect row order after query-based MINOR compaction
> ------------------------------------------------------
>
>                 Key: HIVE-25258
>                 URL: https://issues.apache.org/jira/browse/HIVE-25258
>             Project: Hive
>          Issue Type: Bug
>          Components: Transactions
>            Reporter: Marta Kuczora
>            Assignee: Marta Kuczora
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> The query based MINOR compaction uses the following sorting order in its 
> inner query: `bucket`, `originalTransaction`, `rowId`, as it can be seen in 
> the 
> [code|https://github.com/apache/hive/blob/d0bbe76ad626244802d062b0a93a9f1cd4fc5f20/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/CompactionQueryBuilder.java#L474-L476].
> But actually the rows should be ordered by originalTransactionId, 
> bucketProperty and rowId, otherwise the delete deltas cannot be applied 
> correctly. And this is the order what the MR MAJOR and MR MINOR compactions 
> write. 
> The sorting order used by the query-based MINOR compaction can lead to 
> duplicated rows when running the compaction after multiple merge statements. 
> This issue can be reproduced for example by running the following queries:
> {noformat}
> CREATE TABLE transactions(id int,value string) STORED AS ORC TBLPROPERTIES 
> ('transactional'='true');
> INSERT INTO transactions VALUES
> (1, 'value_01'),(2, 'value_02'),(3, 'value_03'),(4, 'value_04'),(5, 
> 'value_05'),(6, 'value_06'),(7, 'value_07'),(8, 'value_08');
> CREATE TABLE merge_source_1(ID int,value string) STORED AS ORC;
> INSERT INTO merge_source_1 VALUES (1, 'newvalue_1'),(2, 'newvalue_2'),(4, 
> 'newvalue_4'),(6, 'newvalue_6'),(9, 'value_9'),(10, 'value_10'),(11, 
> 'value_11'),(12, 'value_12');
> MERGE INTO transactions AS T USING merge_source_1 AS S ON T.ID = S.ID 
> WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET 
> value = S.value 
> WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);
> CREATE TABLE merge_source_2(ID int, value string) STORED AS ORC;
> INSERT INTO merge_source_2 VALUES
>   (2, 'newestvalue_2'),(4, 'newestvalue_4'),(6, 'newestvalue_6'),(10, 
> 'newestvalue_10'),(11, 'newestvalue_11'),(13, 'value_13'),(14, 'value_14');
> MERGE INTO transactions AS T 
> USING merge_source_2 AS S
> ON T.ID = S.ID
> WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET 
> value = S.value
> WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);
> ALTER TABLE transactions COMPACT 'MINOR';
> CREATE TABLE merge_source_3(ID int, value string) STORED AS ORC;
> INSERT INTO merge_source_3 VALUES
>   (1, 'latestvalue_1'),(4, 'latestvalue_4'),(5, 'latestvalue_5'),(9, 
> 'latestvalue_9'),(11, 'latestvalue_11'),(13, 'latestvalue_13'),(15, 
> 'value_15');
> MERGE INTO transactions AS T 
> USING merge_source_3 AS S
> ON T.ID = S.ID
> WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET 
> value = S.value
> WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);
> ALTER TABLE transactions COMPACT 'MINOR';
> {noformat}
> Running a select after the second compaction finished will return duplicated 
> rows:
> {noformat}
> select * from transactions order by id;
> +------------------+---------------------+
> | transactions.id  | transactions.value  |
> +------------------+---------------------+
> | 1                | newvalue_1          |
> | 1                | latestvalue_1       |
> | 2                | newestvalue_2       |
> | 2                | newvalue_2          |
> | 3                | value_03            |
> | 4                | latestvalue_4       |
> | 4                | newvalue_4          |
> | 5                | latestvalue_5       |
> | 6                | newvalue_6          |
> | 6                | newestvalue_6       |
> | 7                | value_07            |
> | 8                | value_08            |
> | 9                | latestvalue_9       |
> | 10               | newestvalue_10      |
> | 11               | latestvalue_11      |
> | 12               | value_12            |
> | 13               | latestvalue_13      |
> | 14               | value_14            |
> | 15               | value_15            |
> +------------------+---------------------+
> {noformat}
> If the same queries are run with MR MINOR compaction, instead of the 
> query-based MINOR compaction, the select will return the correct result:
> {noformat}
> +------------------+---------------------+
> | transactions.id  | transactions.value  |
> +------------------+---------------------+
> | 1                | latestvalue_1       |
> | 2                | newestvalue_2       |
> | 3                | value_03            |
> | 4                | latestvalue_4       |
> | 5                | latestvalue_5       |
> | 6                | newestvalue_6       |
> | 7                | value_07            |
> | 8                | value_08            |
> | 9                | latestvalue_9       |
> | 10               | newestvalue_10      |
> | 11               | latestvalue_11      |
> | 12               | value_12            |
> | 13               | latestvalue_13      |
> | 14               | value_14            |
> | 15               | value_15            |
> +------------------+---------------------+
> {noformat}
> The content of the bucket files in the delta and delete delta directories 
> after the query-based and MR compactions look like this.
> Query-based
> {noformat}
> Processing data file 
> tmp/transactions/delta_0000001_0000004_v0000429/bucket_00000 [length: 947]
> {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":0,"currenttransaction":1,"row":{"id":1,"value":"value_01"}}
> {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":1,"currenttransaction":1,"row":{"id":2,"value":"value_02"}}
> {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":2,"currenttransaction":1,"row":{"id":3,"value":"value_03"}}
> {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":3,"currenttransaction":1,"row":{"id":4,"value":"value_04"}}
> {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":4,"currenttransaction":1,"row":{"id":5,"value":"value_05"}}
> {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":5,"currenttransaction":1,"row":{"id":6,"value":"value_06"}}
> {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":6,"currenttransaction":1,"row":{"id":7,"value":"value_07"}}
> {"operation":0,"originaltransaction":1,"bucket":536870912,"rowid":7,"currenttransaction":1,"row":{"id":8,"value":"value_08"}}
> {"operation":0,"originaltransaction":2,"bucket":536870912,"rowid":0,"currenttransaction":2,"row":{"id":9,"value":"value_9"}}
> {"operation":0,"originaltransaction":2,"bucket":536870912,"rowid":1,"currenttransaction":2,"row":{"id":10,"value":"value_10"}}
> {"operation":0,"originaltransaction":2,"bucket":536870912,"rowid":2,"currenttransaction":2,"row":{"id":11,"value":"value_11"}}
> {"operation":0,"originaltransaction":2,"bucket":536870912,"rowid":3,"currenttransaction":2,"row":{"id":12,"value":"value_12"}}
> {"operation":0,"originaltransaction":3,"bucket":536870912,"rowid":0,"currenttransaction":3,"row":{"id":13,"value":"value_13"}}
> {"operation":0,"originaltransaction":3,"bucket":536870912,"rowid":1,"currenttransaction":3,"row":{"id":14,"value":"value_14"}}
> {"operation":0,"originaltransaction":4,"bucket":536870912,"rowid":0,"currenttransaction":4,"row":{"id":15,"value":"value_15"}}
> {"operation":0,"originaltransaction":2,"bucket":536870913,"rowid":0,"currenttransaction":2,"row":{"id":1,"value":"newvalue_1"}}
> {"operation":0,"originaltransaction":2,"bucket":536870913,"rowid":1,"currenttransaction":2,"row":{"id":2,"value":"newvalue_2"}}
> {"operation":0,"originaltransaction":2,"bucket":536870913,"rowid":2,"currenttransaction":2,"row":{"id":4,"value":"newvalue_4"}}
> {"operation":0,"originaltransaction":2,"bucket":536870913,"rowid":3,"currenttransaction":2,"row":{"id":6,"value":"newvalue_6"}}
> {"operation":0,"originaltransaction":3,"bucket":536870913,"rowid":0,"currenttransaction":3,"row":{"id":10,"value":"newestvalue_10"}}
> {"operation":0,"originaltransaction":3,"bucket":536870913,"rowid":1,"currenttransaction":3,"row":{"id":11,"value":"newestvalue_11"}}
> {"operation":0,"originaltransaction":3,"bucket":536870913,"rowid":2,"currenttransaction":3,"row":{"id":2,"value":"newestvalue_2"}}
> {"operation":0,"originaltransaction":3,"bucket":536870913,"rowid":3,"currenttransaction":3,"row":{"id":4,"value":"newestvalue_4"}}
> {"operation":0,"originaltransaction":3,"bucket":536870913,"rowid":4,"currenttransaction":3,"row":{"id":6,"value":"newestvalue_6"}}
> {"operation":0,"originaltransaction":4,"bucket":536870913,"rowid":0,"currenttransaction":4,"row":{"id":5,"value":"latestvalue_5"}}
> {"operation":0,"originaltransaction":4,"bucket":536870913,"rowid":1,"currenttransaction":4,"row":{"id":9,"value":"latestvalue_9"}}
> {"operation":0,"originaltransaction":4,"bucket":536870913,"rowid":2,"currenttransaction":4,"row":{"id":1,"value":"latestvalue_1"}}
> {"operation":0,"originaltransaction":4,"bucket":536870913,"rowid":3,"currenttransaction":4,"row":{"id":13,"value":"latestvalue_13"}}
> {"operation":0,"originaltransaction":4,"bucket":536870913,"rowid":4,"currenttransaction":4,"row":{"id":11,"value":"latestvalue_11"}}
> {"operation":0,"originaltransaction":4,"bucket":536870913,"rowid":5,"currenttransaction":4,"row":{"id":4,"value":"latestvalue_4"}}
> Processing data file 
> tmp/transactions/delete_delta_0000001_0000004_v0000429/bucket_00000 [length: 
> 713]
> {"operation":2,"originaltransaction":1,"bucket":536870912,"rowid":0,"currenttransaction":2,"row":null}
> {"operation":2,"originaltransaction":1,"bucket":536870912,"rowid":1,"currenttransaction":2,"row":null}
> {"operation":2,"originaltransaction":1,"bucket":536870912,"rowid":3,"currenttransaction":2,"row":null}
> {"operation":2,"originaltransaction":1,"bucket":536870912,"rowid":4,"currenttransaction":4,"row":null}
> {"operation":2,"originaltransaction":1,"bucket":536870912,"rowid":5,"currenttransaction":2,"row":null}
> {"operation":2,"originaltransaction":2,"bucket":536870912,"rowid":0,"currenttransaction":4,"row":null}
> {"operation":2,"originaltransaction":2,"bucket":536870912,"rowid":1,"currenttransaction":3,"row":null}
> {"operation":2,"originaltransaction":2,"bucket":536870912,"rowid":2,"currenttransaction":3,"row":null}
> {"operation":2,"originaltransaction":3,"bucket":536870912,"rowid":0,"currenttransaction":4,"row":null}
> {"operation":2,"originaltransaction":2,"bucket":536870913,"rowid":0,"currenttransaction":4,"row":null}
> {"operation":2,"originaltransaction":2,"bucket":536870913,"rowid":1,"currenttransaction":3,"row":null}
> {"operation":2,"originaltransaction":2,"bucket":536870913,"rowid":2,"currenttransaction":3,"row":null}
> {"operation":2,"originaltransaction":2,"bucket":536870913,"rowid":3,"currenttransaction":3,"row":null}
> {"operation":2,"originaltransaction":3,"bucket":536870913,"rowid":1,"currenttransaction":4,"row":null}
> {"operation":2,"originaltransaction":3,"bucket":536870913,"rowid":3,"currenttransaction":4,"row":null}
> _____________________________________________________________________________________________________________________
> {noformat}
> MR
> {noformat}
> Processing data file 
> tmp/transactions/delta_0000001_0000004_v0000479/bucket_00000 [length: 1002]
> {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":0,"currentTransaction":1,"row":{"id":1,"value":"value_01"}}
> {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":1,"currentTransaction":1,"row":{"id":2,"value":"value_02"}}
> {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":2,"currentTransaction":1,"row":{"id":3,"value":"value_03"}}
> {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":3,"currentTransaction":1,"row":{"id":4,"value":"value_04"}}
> {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":4,"currentTransaction":1,"row":{"id":5,"value":"value_05"}}
> {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":5,"currentTransaction":1,"row":{"id":6,"value":"value_06"}}
> {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":6,"currentTransaction":1,"row":{"id":7,"value":"value_07"}}
> {"operation":0,"originalTransaction":1,"bucket":536870912,"rowId":7,"currentTransaction":1,"row":{"id":8,"value":"value_08"}}
> {"operation":0,"originalTransaction":2,"bucket":536870912,"rowId":0,"currentTransaction":2,"row":{"id":9,"value":"value_9"}}
> {"operation":0,"originalTransaction":2,"bucket":536870912,"rowId":1,"currentTransaction":2,"row":{"id":10,"value":"value_10"}}
> {"operation":0,"originalTransaction":2,"bucket":536870912,"rowId":2,"currentTransaction":2,"row":{"id":11,"value":"value_11"}}
> {"operation":0,"originalTransaction":2,"bucket":536870912,"rowId":3,"currentTransaction":2,"row":{"id":12,"value":"value_12"}}
> {"operation":0,"originalTransaction":2,"bucket":536870913,"rowId":0,"currentTransaction":2,"row":{"id":1,"value":"newvalue_1"}}
> {"operation":0,"originalTransaction":2,"bucket":536870913,"rowId":1,"currentTransaction":2,"row":{"id":2,"value":"newvalue_2"}}
> {"operation":0,"originalTransaction":2,"bucket":536870913,"rowId":2,"currentTransaction":2,"row":{"id":4,"value":"newvalue_4"}}
> {"operation":0,"originalTransaction":2,"bucket":536870913,"rowId":3,"currentTransaction":2,"row":{"id":6,"value":"newvalue_6"}}
> {"operation":0,"originalTransaction":3,"bucket":536870912,"rowId":0,"currentTransaction":3,"row":{"id":13,"value":"value_13"}}
> {"operation":0,"originalTransaction":3,"bucket":536870912,"rowId":1,"currentTransaction":3,"row":{"id":14,"value":"value_14"}}
> {"operation":0,"originalTransaction":3,"bucket":536870913,"rowId":0,"currentTransaction":3,"row":{"id":10,"value":"newestvalue_10"}}
> {"operation":0,"originalTransaction":3,"bucket":536870913,"rowId":1,"currentTransaction":3,"row":{"id":11,"value":"newestvalue_11"}}
> {"operation":0,"originalTransaction":3,"bucket":536870913,"rowId":2,"currentTransaction":3,"row":{"id":2,"value":"newestvalue_2"}}
> {"operation":0,"originalTransaction":3,"bucket":536870913,"rowId":3,"currentTransaction":3,"row":{"id":4,"value":"newestvalue_4"}}
> {"operation":0,"originalTransaction":3,"bucket":536870913,"rowId":4,"currentTransaction":3,"row":{"id":6,"value":"newestvalue_6"}}
> {"operation":0,"originalTransaction":4,"bucket":536870912,"rowId":0,"currentTransaction":4,"row":{"id":15,"value":"value_15"}}
> {"operation":0,"originalTransaction":4,"bucket":536870913,"rowId":0,"currentTransaction":4,"row":{"id":5,"value":"latestvalue_5"}}
> {"operation":0,"originalTransaction":4,"bucket":536870913,"rowId":1,"currentTransaction":4,"row":{"id":9,"value":"latestvalue_9"}}
> {"operation":0,"originalTransaction":4,"bucket":536870913,"rowId":2,"currentTransaction":4,"row":{"id":1,"value":"latestvalue_1"}}
> {"operation":0,"originalTransaction":4,"bucket":536870913,"rowId":3,"currentTransaction":4,"row":{"id":13,"value":"latestvalue_13"}}
> {"operation":0,"originalTransaction":4,"bucket":536870913,"rowId":4,"currentTransaction":4,"row":{"id":11,"value":"latestvalue_11"}}
> {"operation":0,"originalTransaction":4,"bucket":536870913,"rowId":5,"currentTransaction":4,"row":{"id":4,"value":"latestvalue_4"}}
> _____________________________________________________________________________________________________________________
> Processing data file 
> tmp/transactions/delete_delta_0000001_0000004_v0000479/bucket_00000 [length: 
> 632]
> {"operation":2,"originalTransaction":1,"bucket":536870912,"rowId":0,"currentTransaction":2,"row":null}
> {"operation":2,"originalTransaction":1,"bucket":536870912,"rowId":1,"currentTransaction":2,"row":null}
> {"operation":2,"originalTransaction":1,"bucket":536870912,"rowId":3,"currentTransaction":2,"row":null}
> {"operation":2,"originalTransaction":1,"bucket":536870912,"rowId":4,"currentTransaction":4,"row":null}
> {"operation":2,"originalTransaction":1,"bucket":536870912,"rowId":5,"currentTransaction":2,"row":null}
> {"operation":2,"originalTransaction":2,"bucket":536870912,"rowId":0,"currentTransaction":4,"row":null}
> {"operation":2,"originalTransaction":2,"bucket":536870912,"rowId":1,"currentTransaction":3,"row":null}
> {"operation":2,"originalTransaction":2,"bucket":536870912,"rowId":2,"currentTransaction":3,"row":null}
> {"operation":2,"originalTransaction":2,"bucket":536870913,"rowId":0,"currentTransaction":4,"row":null}
> {"operation":2,"originalTransaction":2,"bucket":536870913,"rowId":1,"currentTransaction":3,"row":null}
> {"operation":2,"originalTransaction":2,"bucket":536870913,"rowId":2,"currentTransaction":3,"row":null}
> {"operation":2,"originalTransaction":2,"bucket":536870913,"rowId":3,"currentTransaction":3,"row":null}
> {"operation":2,"originalTransaction":3,"bucket":536870912,"rowId":0,"currentTransaction":4,"row":null}
> {"operation":2,"originalTransaction":3,"bucket":536870913,"rowId":1,"currentTransaction":4,"row":null}
> {"operation":2,"originalTransaction":3,"bucket":536870913,"rowId":3,"currentTransaction":4,"row":null}
> _____________________________________________________________________________________________________________________
> {noformat}
> It can be seen that when the query-based MINOR compaction was used, the 
> sorting order of the rows is "bucket, originalTransactionId, rowId".  But 
> when the MR MINOR compaction was used, the order is "originalTransactionId, 
> bucket, rowId".
> The ordering in the query-based compaction has to be fixed to be aligned with 
> the MR compaction's ordering.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to