Tomas created HBASE-29499:
-----------------------------
Summary: Serial replication stuck pushing entry with seqId equal
to barrier
Key: HBASE-29499
URL: https://issues.apache.org/jira/browse/HBASE-29499
Project: HBase
Issue Type: Bug
Components: Replication
Affects Versions: 2.6.2
Reporter: Tomas
HBase version: 2.6.2-hadoop3, revision=6b3b36b429cf9a9d74110de79eb3b327b29ebf17
h1. Problem
On several test HBase clusters with serial replication enabled and where
regionservers frequently crash / perform non-graceful shutdown, we found that
WAL can contain entries with seqId equal to a barrier in the meta table, e.g.
barriers for region X = [2, 5, 6], entry for region X seqId = 6 (equals to
barrier with value 6), and pushedSeqId=4 (seqId-2).
When checking if can push those entries in {_}SerialReplicationChecker{_},
_canPush_ will return false, causing replication to block indefinitely.
Example 1:
{{2025-07-22T16:12:06,070 DEBUG
[RS_CLAIM_REPLICATION_QUEUE-regionserver/home-host-1:16020-0.replicationSource,peer_1-home-host-1,16020,1753116284068.replicationSource.wal-reader.home-host-1%2C16020%2C1753116284068,peer_1-home-host-1,16020,1753116284068]
regionserver.SerialReplicationChecker: Replication barrier for
test_table/eb9d5e0c9147f04e0ef1296c959c2ae9/{*}39{*}=[#edits: 0 = <>]:
ReplicationBarrierResult [{*}barriers=[9, 17, 25, 28, 31, 34, 38, 39{*}],
state=OPEN, parentRegionNames=]}}
{{2025-07-22T16:12:06,072 DEBUG
[RS_CLAIM_REPLICATION_QUEUE-regionserver/home-host-1:16020-0.replicationSource,peer_1-home-host-1,16020,1753116284068.replicationSource.wal-reader.home-host-1%2C16020%2C1753116284068,peer_1-home-host-1,16020,1753116284068]
regionserver.SerialReplicationChecker: *Previous range for
test_table/eb9d5e0c9147f04e0ef1296c959c2ae9/39=[#edits: 0 = <>] has not been
finished yet, give up*}}
{{2025-07-22T16:12:06,072 DEBUG
[RS_CLAIM_REPLICATION_QUEUE-regionserver/home-host-1:16020-0.replicationSource,peer_1-home-host-1,16020,1753116284068.replicationSource.wal-reader.home-host-1%2C16020%2C1753116284068,peer_1-home-host-1,16020,1753116284068]
regionserver.SerialReplicationChecker: Can not push
test_table/eb9d5e0c9147f04e0ef1296c959c2ae9/39=[#edits: 0 = <>], wait}}
* barriers=[9, 17, 25, 28, 31, 34, 38, 39]
* Entry is for HBASE::REGION_EVENT::REGION_OPEN with seqid=39 from *not the
last* range (replication queue is claimed).
* pushedSeqId=37
The previous range is calculated as 39 instead of 38, and 37 >= 39-1 is false.
See
[https://docs.google.com/document/d/1iB2xopSoC2IRHR8wmbGX5cmaS0RKsdFJiKeJ7EyLzeg]
for more supporting information (zookeeper state, WALs).
Example 2:
{{2025-08-05T07:43:53,198 DEBUG
[RS_CLAIM_REPLICATION_QUEUE-regionserver/regionserver-0:16020-0.replicationSource,hbase_analytics_1-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754081460813-regionserver-0.hbase.hbase.svc.cluster.local,16020,1754258850214-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754343051138-regionserver-4.hbase.hbase.svc.cluster.local,16020,1754345729428-regionserver-1.hbase.hbase.svc.cluster.local,16020,1754367453843.replicationSource.wal-reader.regionserver-3.hbase.hbase.svc.cluster.local%2C16020%2C1754081460813,hbase_analytics_1-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754081460813-regionserver-0.hbase.hbase.svc.cluster.local,16020,1754258850214-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754343051138-regionserver-4.hbase.hbase.svc.cluster.local,16020,1754345729428-regionserver-1.hbase.hbase.svc.cluster.local,16020,1754367453843
{}] regionserver.SerialReplicationChecker: Replication barrier for
aeris_v2/cfc70a1a9c3a8c459dc4b79ece6d1ebd/{*}650974464{*}=[#edits: 0 = <>]:
ReplicationBarrierResult [barriers=[649436971, {*}650974464{*}, 650990494,
651037843, 651092522, 651096754, 651118516, 651147941, 651173589], state=OPEN,
parentRegionNames=]}}
{{2025-08-05T07:43:53,199 DEBUG
[RS_CLAIM_REPLICATION_QUEUE-regionserver/regionserver-0:16020-0.replicationSource,hbase_analytics_1-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754081460813-regionserver-0.hbase.hbase.svc.cluster.local,16020,1754258850214-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754343051138-regionserver-4.hbase.hbase.svc.cluster.local,16020,1754345729428-regionserver-1.hbase.hbase.svc.cluster.local,16020,1754367453843.replicationSource.wal-reader.regionserver-3.hbase.hbase.svc.cluster.local%2C16020%2C1754081460813,hbase_analytics_1-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754081460813-regionserver-0.hbase.hbase.svc.cluster.local,16020,1754258850214-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754343051138-regionserver-4.hbase.hbase.svc.cluster.local,16020,1754345729428-regionserver-1.hbase.hbase.svc.cluster.local,16020,1754367453843
{}] regionserver.SerialReplicationChecker: *Previous range for
aeris_v2/cfc70a1a9c3a8c459dc4b79ece6d1ebd/650974464=[#edits: 0 = <>] has not
been finished yet, give up*}}
{{2025-08-05T07:43:53,199 DEBUG
[RS_CLAIM_REPLICATION_QUEUE-regionserver/regionserver-0:16020-0.replicationSource,hbase_analytics_1-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754081460813-regionserver-0.hbase.hbase.svc.cluster.local,16020,1754258850214-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754343051138-regionserver-4.hbase.hbase.svc.cluster.local,16020,1754345729428-regionserver-1.hbase.hbase.svc.cluster.local,16020,1754367453843.replicationSource.wal-reader.regionserver-3.hbase.hbase.svc.cluster.local%2C16020%2C1754081460813,hbase_analytics_1-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754081460813-regionserver-0.hbase.hbase.svc.cluster.local,16020,1754258850214-regionserver-3.hbase.hbase.svc.cluster.local,16020,1754343051138-regionserver-4.hbase.hbase.svc.cluster.local,16020,1754345729428-regionserver-1.hbase.hbase.svc.cluster.local,16020,1754367453843
{}] regionserver.SerialReplicationChecker: Can not push
aeris_v2/cfc70a1a9c3a8c459dc4b79ece6d1ebd/650974464=[#edits: 0 = <>], wait}}
* barriers=[649436971, 650974464, 650990494, …]
* Entry is with seqid=650974464 from *not the last* range (replication queue
is claimed).
* pushedSeqId=650974462
The previous range is calculated as 650974464 instead of 649436971, and
650974462 >= 650974464-1 is false.
h1. Impact
Replication is blocked indefinitely for regions that contain the problematic
entry.
Entries with higher seqId than the problematic entry cannot be replicated due
to previous range(s) not being finished yet.
Metric _sizeoflogqueue_ grows indefinitely as data gets written to the
region(s) and WAL's are rolled.
h1. Workarounds
N/A.
Turn off serial mode and replicate non-serially OR remove and re-add peer to
restart replication (will have a gap in data replicated).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)