[
https://issues.apache.org/jira/browse/HIVE-29572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Marta Kuczora updated HIVE-29572:
---------------------------------
Description:
It can happen that a compaction is marked as finished and get into "ready for
cleaning" state, but the compaction txn stays open. And when the timeout
reached, the txn gets aborted.
With min.history.level, a compaction like this can block the cleaning for all
consecutive compaction.
This is what happens:
* Cleaner picks compaction1 and finds nothing to delete, because it doesn’t
find valid base (which is correct as this cleaner should only see what
compaction 1 did and its txn is not committed)
* Deletes nothing but finds obsolete deltas (because here the txn range is
cleared and finds the base), so puts back the compaction to the queue with
‘ready-for-cleaning’ state.
* The other compaction’s are not fetched by the cleaner.
* The problem is that even after the txn of compaction 1 is aborted, the same
will happen, so the cleaner will be blocked forever.
To avoid this blocking, the cleaner should check the state of the compaction
txn and if it is already aborted, mark the compaction as failed and delete
nothing.
was:We ran into some situations when the compaction was marked as finished
and was in ready for cleaning state, but the compaction txn was still open.
This inconsistency led to data loss. There were some improvements in the
cleaner to avoid these situations, but we should consider checking the txn
state when the cleaner selects a compaction to clean.
> ACID Compaction: Cleaner should mark a compaction failed when its txn is
> aborted
> --------------------------------------------------------------------------------
>
> Key: HIVE-29572
> URL: https://issues.apache.org/jira/browse/HIVE-29572
> Project: Hive
> Issue Type: Task
> Reporter: Marta Kuczora
> Assignee: Marta Kuczora
> Priority: Major
> Labels: pull-request-available
>
> It can happen that a compaction is marked as finished and get into "ready for
> cleaning" state, but the compaction txn stays open. And when the timeout
> reached, the txn gets aborted.
> With min.history.level, a compaction like this can block the cleaning for all
> consecutive compaction.
> This is what happens:
> * Cleaner picks compaction1 and finds nothing to delete, because it doesn’t
> find valid base (which is correct as this cleaner should only see what
> compaction 1 did and its txn is not committed)
> * Deletes nothing but finds obsolete deltas (because here the txn range is
> cleared and finds the base), so puts back the compaction to the queue with
> ‘ready-for-cleaning’ state.
> * The other compaction’s are not fetched by the cleaner.
> * The problem is that even after the txn of compaction 1 is aborted, the
> same will happen, so the cleaner will be blocked forever.
> To avoid this blocking, the cleaner should check the state of the compaction
> txn and if it is already aborted, mark the compaction as failed and delete
> nothing.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)