[jira] [Commented] (HIVE-29536) Improve the rebalance compaction tests in TestCrudCompactorOnTez

Thomas Rebele (Jira) Tue, 12 May 2026 08:21:40 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-29536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18080335#comment-18080335
 ]


Thomas Rebele commented on HIVE-29536:
--------------------------------------

[~InvisibleProgrammer] and I had investigated this case and discussed some of 
the details. Here's how I understand our findings:

It was quite interesting to discover that the behavior of Tez depends on the 
version string of orc-core.
 # The ORC files are written in the preparation step, with the ORC footer 
containing the version string. Depending on how well the version string 
compresses, the file size of the ORC files changes
 # TezSplitGrouper calculates the total size and divides it by the number of 
desired partitions, resulting in {{lengthPerGroup}}
 # It iterates over the splits, and puts them into the same group, as long as 
the group's size is {{<= lengthPerGroup}}
 # If the file size is a bit bigger, sometimes adding another split would get 
over that limit, so it is not included; so the assignment of splits to groups 
may change depending on the ORC version string

The TezSplitGrouper is applied when executing a simple {{{}SELECT ...{}}}, 
changing the order of the rows; only the order imposed by {{ORDER BY}} is 
guaranteed.
 
The test 
TestCrudCompactorOnTez#testRebalanceCompactionOfNotPartitionedImplicitlyBucketedTableWithOrder
 [sets an explicit order for the 
compaction|https://github.com/apache/hive/blob/ba43ea33acf4cba437b4625387d5ed71a8acdf7e/itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/txn/compactor/TestCrudCompactorOnTez.java#L126]:
{code:java}
 executeStatementOnDriver("ALTER TABLE " + tableName + " COMPACT 'rebalance' 
ORDER BY b DESC", driver);{code}
The corresponding compaction query is
{code:java}
INSERT overwrite table default.tmp_compactor_rebalance_test_1778598114174
select 0, t2.writeId, t2.rowId DIV CEIL(numRows / 4), t2.rowId, t2.writeId, 
t2.data
from
  (select count(ROW__ID.writeId) over() as numRows,
          MAX(ROW__ID.writeId) over() as writeId,
          row_number() OVER (ORDER BY b DESC) - 1 AS rowId,
          NAMED_STRUCT('a', `a`, 'b', `b`) as data
   from default.rebalance_test ORDER BY b DESC) t2{code}
It contains an {{ORDER BY b DESC}} (because of the same clause in the {{ALTER 
TABLE ... COMPACT 'rebalance' ORDER BY b DESC}} statement). So it's only 
guaranteed that the result is ordered by column {{{}b{}}}, not any other column.

The test assert makes assumptions that do not hold. The order of the records 
could be made stable by changing the statement to {{{}ORDER BY b DESC, a 
DESC{}}}, instead of just column b.

> Improve the rebalance compaction tests in TestCrudCompactorOnTez
> ----------------------------------------------------------------
>
>                 Key: HIVE-29536
>                 URL: https://issues.apache.org/jira/browse/HIVE-29536
>             Project: Hive
>          Issue Type: Task
>            Reporter: Marta Kuczora
>            Assignee: Marta Kuczora
>            Priority: Major
>
> Check the asserts in the rebalance compaction tests in TestCrudCompactorOnTez.
> Currently the rows are checked against hard coded strings. This caused 
> failure during the ORC version upgrade because the order of the data was 
> changed. We should think about a more robust way of validating the file 
> content after the rebase compaction so it wouldn't fail again when the ORC 
> version is upgraded.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-29536) Improve the rebalance compaction tests in TestCrudCompactorOnTez

Reply via email to