haosen chen created HBASE-29463:
-----------------------------------

             Summary: Bidirectional serial replication will block if a region’s 
last edit before rs crashed was from the peer cluster
                 Key: HBASE-29463
                 URL: https://issues.apache.org/jira/browse/HBASE-29463
             Project: HBase
          Issue Type: Bug
          Components: Replication
    Affects Versions: 2.4.5
            Reporter: haosen chen


For two HBase clusters that enable bidirectional replication and set up serial 
replication, when a region in cluster A got the last edit from peer cluster and 
then the rs crashs, the replication from cluster A to B will block. Because in 
this situation, the HBase replication system will wait until the last pushed 
sequence id reaches the new barrier but edit from peer cluster will never be 
pushed.

When Region r1 in Cluster A pushes its last edit (e.g., seqID 58) to Cluster B 
and subsequently received two additional edits (seqID 59–60) from Cluster B and 
then the rs crashed, Region r1 will be reopened on another RegionServer and set 
a new barrier at seqID 61. However, edits 59–60 will never be pushed to Cluster 
B again, causing the _last pushed sequenceId_ to stagnate. As a result, the 
{{SerialReplicationChecker}} will repeatedly fail its checks.

The new RS will keep print DEBUG LOG:
2025-07-14 20:05:53,953 DEBUG 
[RS_OPEN_REGION-regionserver/172.16.0.43:6002-0.replicationSource.wal-reader.172.16.0.43%2C6002%2C1752216296629.172.16.0.43%2C6002%2C1752216296629.regiongroup-1,1]
 regionserver.SerialReplicationChecker: Replication barrier for 
test1/46b4ecbd63d7fbcb16d68e106f904013/30=[#edits: 0 = <>]: 
ReplicationBarrierResult [barriers=[23, 29, 68], state=OPEN, parentRegionNames=]
2025-07-14 20:05:53,953 DEBUG 
[RS_OPEN_REGION-regionserver/172.16.0.43:6002-0.replicationSource.wal-reader.172.16.0.43%2C6002%2C1752216296629.172.16.0.43%2C6002%2C1752216296629.regiongroup-1,1]
 regionserver.SerialReplicationChecker: Previous range for 
test1/46b4ecbd63d7fbcb16d68e106f904013/30=[#edits: 0 = <>] has not been 
finished yet, give up
2025-07-14 20:05:53,953 DEBUG 
[RS_OPEN_REGION-regionserver/172.16.0.43:6002-0.replicationSource.wal-reader.172.16.0.43%2C6002%2C1752216296629.172.16.0.43%2C6002%2C1752216296629.regiongroup-1,1]
 regionserver.SerialReplicationChecker: Can not push 
test1/46b4ecbd63d7fbcb16d68e106f904013/30=[#edits: 0 = <>], wait



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to