[
https://issues.apache.org/jira/browse/HBASE-29463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Duo Zhang resolved HBASE-29463.
-------------------------------
Fix Version/s: 2.7.0
3.0.0-beta-2
2.6.4
2.5.13
Hadoop Flags: Reviewed
Resolution: Fixed
Pushed to all active branches.
Thanks [~haosen chen] for analyzing the issue and [~ndimiduk] for reviewing!
> Bidirectional serial replication will block if a region’s last edit before rs
> crashed was from the peer cluster
> ---------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-29463
> URL: https://issues.apache.org/jira/browse/HBASE-29463
> Project: HBase
> Issue Type: Bug
> Components: Replication
> Affects Versions: 2.4.5
> Reporter: haosen chen
> Assignee: Duo Zhang
> Priority: Critical
> Labels: pull-request-available
> Fix For: 2.7.0, 3.0.0-beta-2, 2.6.4, 2.5.13
>
> Attachments: image-2025-07-21-14-52-19-057.png,
> image-2025-07-21-14-52-47-751.png
>
>
> For two HBase clusters that enable bidirectional replication and set up
> serial replication, when a region in cluster A received last edit from peer
> cluster before RS crashed, the replication from cluster A to B will block.
> Because in this situation, the HBase replication system will wait until the
> last pushed sequence id reaches the new barrier but edit received from peer
> cluster will never be pushed.
> When Region r1 in Cluster A pushes its last edit (e.g., seqID 58) to Cluster
> B and subsequently received two additional edits (seqID 59–60) from Cluster B
> and then the rs crashed, Region r1 will be reopened on another RegionServer
> and set a new barrier at seqID 61. However, edits 59–60 will never be pushed
> to Cluster B again, causing the _last pushed sequenceId_ to stagnate. As a
> result, the {{SerialReplicationChecker}} will repeatedly fail its checks.
> The new RS will keep print DEBUG LOG:
> 2025-07-14 20:05:53,953 DEBUG
> [RS_OPEN_REGION-regionserver/172.16.0.43:6002-0.replicationSource.wal-reader.172.16.0.43%2C6002%2C1752216296629.172.16.0.43%2C6002%2C1752216296629.regiongroup-1,1]
> regionserver.SerialReplicationChecker: Replication barrier for
> test1/46b4ecbd63d7fbcb16d68e106f904013/30=[#edits: 0 = <>]:
> ReplicationBarrierResult [barriers=[23, 29, 68], state=OPEN,
> parentRegionNames=]
> 2025-07-14 20:05:53,953 DEBUG
> [RS_OPEN_REGION-regionserver/172.16.0.43:6002-0.replicationSource.wal-reader.172.16.0.43%2C6002%2C1752216296629.172.16.0.43%2C6002%2C1752216296629.regiongroup-1,1]
> regionserver.SerialReplicationChecker: Previous range for
> test1/46b4ecbd63d7fbcb16d68e106f904013/30=[#edits: 0 = <>] has not been
> finished yet, give up
> 2025-07-14 20:05:53,953 DEBUG
> [RS_OPEN_REGION-regionserver/172.16.0.43:6002-0.replicationSource.wal-reader.172.16.0.43%2C6002%2C1752216296629.172.16.0.43%2C6002%2C1752216296629.regiongroup-1,1]
> regionserver.SerialReplicationChecker: Can not push
> test1/46b4ecbd63d7fbcb16d68e106f904013/30=[#edits: 0 = <>], wait
--
This message was sent by Atlassian Jira
(v8.20.10#820010)