[ 
https://issues.apache.org/jira/browse/HIVE-25257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-25257:
---------------------------------
    Description: 
In the insert query of the query-based MAJOR compaction, there is this function 
call: "validate_acid_sort_order(ROW__ID.writeId, ROW__ID.bucketId, 
ROW__ID.rowId)".
This is to validate if the order of the rows is correct. This validation is 
done by the GenericUDFValidateAcidSortOrder class and it assumes that the rows 
are in increasing order by bucketProperty, originalTransactionId and rowId. 

But actually the rows should be ordered by originalTransactionId, 
bucketProperty and rowId, otherwise the delete deltas cannot be applied 
correctly. And this is the order what the MR MAJOR compaction writes and how 
the split groups are created for the query-based MAJOR compaction. It doesn't 
cause any issue until there is only one bucketProperty in the files, but as 
soon as there are multiple bucketProperties in the same file, the validation 
will fail. This can be reproduced by running multiple merge statements after 
each other.
For example:
{noformat}
CREATE TABLE transactions (id int,value string) STORED AS ORC TBLPROPERTIES 
('transactional'='true');

INSERT INTO transactions VALUES
(1, 'value_1'),
(2, 'value_2'),
(3, 'value_3'),
(4, 'value_4'),
(5, 'value_5');

CREATE TABLE merge_source_1(ID int,value string) STORED AS ORC;
INSERT INTO merge_source_1 VALUES 
(1, 'newvalue_1'),
(2, 'newvalue_2'),
(3, 'newvalue_3'),
(6, 'value_6'),
(7, 'value_7');

MERGE INTO transactions AS T USING merge_source_1 AS S ON T.ID = S.ID 
WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET 
value = S.value 
WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);

CREATE TABLE merge_source_2(
 ID int,
 value string)
STORED AS ORC;

INSERT INTO merge_source_2 VALUES
(1, 'newestvalue_1'),
(2, 'newestvalue_2'),
(5, 'newestvalue_5'),
(7, 'newestvalue_7'),
(8, 'value_18);

MERGE INTO transactions AS T 
USING merge_source_2 AS S
ON T.ID = S.ID
WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET 
value = S.value
WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);

ALTER TABLE transactions COMPACT 'MAJOR';
{noformat}
The MAJOR compaction will fail with the following error:
{noformat}
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Wrong sort order 
of Acid rows detected for the rows: 
org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder$WriteIdRowId@4d3ef25e
 and 
org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder$WriteIdRowId@1c9df436
        at 
org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder.evaluate(GenericUDFValidateAcidSortOrder.java:80)
{noformat}
So the validation doesn't check for the correct row order. The correct order is 
originalTransactionId, bucketProperty, rowId.

> Incorrect row order validation for query-based MAJOR compaction
> ---------------------------------------------------------------
>
>                 Key: HIVE-25257
>                 URL: https://issues.apache.org/jira/browse/HIVE-25257
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Marta Kuczora
>            Assignee: Marta Kuczora
>            Priority: Major
>
> In the insert query of the query-based MAJOR compaction, there is this 
> function call: "validate_acid_sort_order(ROW__ID.writeId, ROW__ID.bucketId, 
> ROW__ID.rowId)".
> This is to validate if the order of the rows is correct. This validation is 
> done by the GenericUDFValidateAcidSortOrder class and it assumes that the 
> rows are in increasing order by bucketProperty, originalTransactionId and 
> rowId. 
> But actually the rows should be ordered by originalTransactionId, 
> bucketProperty and rowId, otherwise the delete deltas cannot be applied 
> correctly. And this is the order what the MR MAJOR compaction writes and how 
> the split groups are created for the query-based MAJOR compaction. It doesn't 
> cause any issue until there is only one bucketProperty in the files, but as 
> soon as there are multiple bucketProperties in the same file, the validation 
> will fail. This can be reproduced by running multiple merge statements after 
> each other.
> For example:
> {noformat}
> CREATE TABLE transactions (id int,value string) STORED AS ORC TBLPROPERTIES 
> ('transactional'='true');
> INSERT INTO transactions VALUES
> (1, 'value_1'),
> (2, 'value_2'),
> (3, 'value_3'),
> (4, 'value_4'),
> (5, 'value_5');
> CREATE TABLE merge_source_1(ID int,value string) STORED AS ORC;
> INSERT INTO merge_source_1 VALUES 
> (1, 'newvalue_1'),
> (2, 'newvalue_2'),
> (3, 'newvalue_3'),
> (6, 'value_6'),
> (7, 'value_7');
> MERGE INTO transactions AS T USING merge_source_1 AS S ON T.ID = S.ID 
> WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET 
> value = S.value 
> WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);
> CREATE TABLE merge_source_2(
>  ID int,
>  value string)
> STORED AS ORC;
> INSERT INTO merge_source_2 VALUES
> (1, 'newestvalue_1'),
> (2, 'newestvalue_2'),
> (5, 'newestvalue_5'),
> (7, 'newestvalue_7'),
> (8, 'value_18);
> MERGE INTO transactions AS T 
> USING merge_source_2 AS S
> ON T.ID = S.ID
> WHEN MATCHED AND (T.value != S.value AND S.value IS NOT NULL) THEN UPDATE SET 
> value = S.value
> WHEN NOT MATCHED THEN INSERT VALUES (S.ID, S.value);
> ALTER TABLE transactions COMPACT 'MAJOR';
> {noformat}
> The MAJOR compaction will fail with the following error:
> {noformat}
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Wrong sort order 
> of Acid rows detected for the rows: 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder$WriteIdRowId@4d3ef25e
>  and 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder$WriteIdRowId@1c9df436
>       at 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFValidateAcidSortOrder.evaluate(GenericUDFValidateAcidSortOrder.java:80)
> {noformat}
> So the validation doesn't check for the correct row order. The correct order 
> is originalTransactionId, bucketProperty, rowId.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to