Harshal Patel created HIVE-28975:
------------------------------------
Summary: [HiveAcidReplication] Remove dangling txns from Target
side post incremental replication
Key: HIVE-28975
URL: https://issues.apache.org/jira/browse/HIVE-28975
Project: Hive
Issue Type: Improvement
Components: repl
Reporter: Harshal Patel
Assignee: Harshal Patel
*Context and Problem Statement:*
Currently, due to certain inconsistencies on the Hive side, customers are
frequently encountering the repl_incompatible error, triggered by different
underlying issues.
* *Current Issue:* There are missing entries in the txn_write_notification_log
table for TRUNCATE operations. This causes problems when the Hive configuration
property hive.repl.filter.transactions is set to true.
To improve resiliency from the replication side, we propose a mechanism to
clean up dangling transaction entries on the Disaster Recovery (DR) cluster
after the incremental load completes.
*Proposed Solution:*
We introduce a mechanism to capture and reconcile the state of open
transactions during the replication process.
h3. *Steps:*
# *Capture Initial Open Transactions:*
* At the beginning of the incremental dump, capture the list of open
transactions.
* For example, this initial list might be: 1, 2, 3.
# *Proceed with Normal Dump Process:*
* While the dump is in progress, some transactions may complete, and new ones
may start.
* For instance, suppose transaction 1 completes and transaction 4 starts.
# *Capture Final Open Transactions:*
* After the dump completes, capture the list of open transactions again.
* This list might now be: 2, 3, 4.
* Append the new transaction (4 in this case) to the list and persist it in a
file.
# *During Load on the DR Cluster:*
* Here load will have 1,2,3,4 as open transactions from source
* After the load process completes, retrieve the transaction list from the
repl_txn_map for the respective database.
# *Clean Dangling Transactions:*
* Abort the transactions on the DR cluster that are *not* present in the final
list of transactions captured in step 3.
* It will be like remove from repl_txn_map where not in (list of open txn from
source)
h3. *Rationale Behind Key Steps:*
*Why is Step 1 Important?*
If the initial list of open transactions is not captured, the dump process
might begin with a set of transactions assumed to be in a consistent state. For
example, if transaction 1 was open at the time the dump started, it will remain
open on the DR cluster after replication. But it got closed during dump was
running. So, skipping this step would result in incorrect abortion of valid
transactions during cleanup (step 5).
*Why is Step 3 Important?*
If a transaction (e.g., transaction 4) is opened between steps 1 and 2 and is
replicated as part of the dump, it must be included in the list. Otherwise, it
would be incorrectly aborted during the cleanup phase (step 5).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)