[ https://issues.apache.org/jira/browse/HIVE-25257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Marta Kuczora updated HIVE-25257: --------------------------------- Component/s: Transactions > Incorrect row order validation for query-based MAJOR compaction > --------------------------------------------------------------- > > Key: HIVE-25257 > URL: https://issues.apache.org/jira/browse/HIVE-25257 > Project: Hive > Issue Type: Bug > Components: Transactions > Reporter: Marta Kuczora > Assignee: Marta Kuczora > Priority: Major > Fix For: 4.0.0 > > > In the insert query of the query-based MAJOR compaction, there is this > function call: "validate_acid_sort_order(ROW__ID.writeId, ROW__ID.bucketId, > ROW__ID.rowId)". > This is to validate if the order of the rows is correct. This validation is > done by the GenericUDFValidateAcidSortOrder class and it assumes that the > rows are in increasing order by bucketProperty, originalTransactionId and > rowId. > But actually the rows should be ordered by originalTransactionId, > bucketProperty and rowId, otherwise the delete deltas cannot be applied > correctly. And this is the order what the MR MAJOR compaction writes and how > the split groups are created for the query-based MAJOR compaction. It doesn't > cause any issue until there is only one bucketProperty in the files, but as > soon as there are multiple bucketProperties in the same file, the validation > will fail. This can be reproduced by running multiple merge statements after > each other. > For example: > {noformat} > CREATE TABLE transactions (id int,value string) STORED AS ORC TBLPROPERTIES > ('transactional'='true'); > INSERT INTO transactions VALUES > (1, 'value_1'), > (2, 'value_2'), > (3, 'value_3'), > (4, 'value_4'), > (5, 'value_5'); > CREATE TABLE merge_source_1(ID int,value string) STORED AS ORC; > INSERT INTO merge_source_1 VALUES > (1, 'newvalue_1'), > (2, 'newvalue_2'), > (3, 'newvalue_3'), > (6, 'value_6'), > (7, 'value_7'); > MERGE INTO transactions AS T USING merge_source_1 AS S ON T.ID = S.ID > WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET > value = S.value > WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value); > CREATE TABLE merge_source_2( > ID int, > value string) > STORED AS ORC; > INSERT INTO merge_source_2 VALUES > (1, 'newestvalue_1'), > (2, 'newestvalue_2'), > (5, 'newestvalue_5'), > (7, 'newestvalue_7'), > (8, 'value_18); > MERGE INTO transactions AS T > USING merge_source_2 AS S > ON T.ID = S.ID > WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET > value = S.value > WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value); > ALTER TABLE transactions COMPACT 'MAJOR'; > {noformat} > The MAJOR compaction will fail with the following error: > {noformat} > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Wrong sort order > of Acid rows detected for the rows: > org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder$WriteIdRowId@4d3ef25e > and > org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder$WriteIdRowId@1c9df436 > at > org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder.evaluate(GenericUDFValidateAcidSortOrder.java:80) > {noformat} > So the validation doesn't check for the correct row order. The correct order > is originalTransactionId, bucketProperty, rowId. -- This message was sent by Atlassian Jira (v8.3.4#803005)