[
https://issues.apache.org/jira/browse/HIVE-29536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18080335#comment-18080335
]
Thomas Rebele commented on HIVE-29536:
--------------------------------------
[~InvisibleProgrammer] and I had investigated this case and discussed some of
the details. Here's how I understand our findings:
It was quite interesting to discover that the behavior of Tez depends on the
version string of orc-core.
# The ORC files are written in the preparation step, with the ORC footer
containing the version string. Depending on how well the version string
compresses, the file size of the ORC files changes
# TezSplitGrouper calculates the total size and divides it by the number of
desired partitions, resulting in {{lengthPerGroup}}
# It iterates over the splits, and puts them into the same group, as long as
the group's size is {{<= lengthPerGroup}}
# If the file size is a bit bigger, sometimes adding another split would get
over that limit, so it is not included; so the assignment of splits to groups
may change depending on the ORC version string
The TezSplitGrouper is applied when executing a simple {{{}SELECT ...{}}},
changing the order of the rows; only the order imposed by {{ORDER BY}} is
guaranteed.
The test
TestCrudCompactorOnTez#testRebalanceCompactionOfNotPartitionedImplicitlyBucketedTableWithOrder
[sets an explicit order for the
compaction|https://github.com/apache/hive/blob/ba43ea33acf4cba437b4625387d5ed71a8acdf7e/itests/hive-unit/src/test/java/org/apache/hadoop/hive/ql/txn/compactor/TestCrudCompactorOnTez.java#L126]:
{code:java}
executeStatementOnDriver("ALTER TABLE " + tableName + " COMPACT 'rebalance'
ORDER BY b DESC", driver);{code}
The corresponding compaction query is
{code:java}
INSERT overwrite table default.tmp_compactor_rebalance_test_1778598114174
select 0, t2.writeId, t2.rowId DIV CEIL(numRows / 4), t2.rowId, t2.writeId,
t2.data
from
(select count(ROW__ID.writeId) over() as numRows,
MAX(ROW__ID.writeId) over() as writeId,
row_number() OVER (ORDER BY b DESC) - 1 AS rowId,
NAMED_STRUCT('a', `a`, 'b', `b`) as data
from default.rebalance_test ORDER BY b DESC) t2{code}
It contains an {{ORDER BY b DESC}} (because of the same clause in the {{ALTER
TABLE ... COMPACT 'rebalance' ORDER BY b DESC}} statement). So it's only
guaranteed that the result is ordered by column {{{}b{}}}, not any other column.
The test assert makes assumptions that do not hold. The order of the records
could be made stable by changing the statement to {{{}ORDER BY b DESC, a
DESC{}}}, instead of just column b.
> Improve the rebalance compaction tests in TestCrudCompactorOnTez
> ----------------------------------------------------------------
>
> Key: HIVE-29536
> URL: https://issues.apache.org/jira/browse/HIVE-29536
> Project: Hive
> Issue Type: Task
> Reporter: Marta Kuczora
> Assignee: Marta Kuczora
> Priority: Major
>
> Check the asserts in the rebalance compaction tests in TestCrudCompactorOnTez.
> Currently the rows are checked against hard coded strings. This caused
> failure during the ORC version upgrade because the order of the data was
> changed. We should think about a more robust way of validating the file
> content after the rebase compaction so it wouldn't fail again when the ORC
> version is upgraded.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)