[
https://issues.apache.org/jira/browse/HBASE-29501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Duo Zhang resolved HBASE-29501.
-------------------------------
Fix Version/s: 2.7.0
3.0.0-beta-2
2.6.5
Hadoop Flags: Reviewed
Resolution: Fixed
Pushed to branch-2.6+.
Thanks [~tomasb] for contributing!
> IOException in SerialReplicationChecker.canPush causes entries to be pushed
> out of order
> ----------------------------------------------------------------------------------------
>
> Key: HBASE-29501
> URL: https://issues.apache.org/jira/browse/HBASE-29501
> Project: HBase
> Issue Type: Bug
> Components: Replication
> Affects Versions: 2.6.2
> Reporter: Tomas
> Assignee: Tomas
> Priority: Major
> Labels: pull-request-available
> Fix For: 2.7.0, 3.0.0-beta-2, 2.6.5
>
>
> HBase version: 2.6.2-hadoop3,
> revision=6b3b36b429cf9a9d74110de79eb3b327b29ebf17
> h1. Problem
> In several HBase test clusters with serial replication enabled, observed
> entries with higher sequence ID being pushed before entries with lower
> sequence ID when _SerialReplicationChecker.canPush_ throws an
> {_}IOException{_}.
> The exception is caught in
> {_}SerialReplicationSourceWALReader.readWALEntries{_}. When handling the
> exception instead of breaking out of the surrounding for loop the code may
> continue to push the entry and record its sequence ID in zookeeper:
>
> {code:java}
> try {
> if (!checker.canPush(entry, firstCellInEntryBeforeFiltering)) {
> if (batch.getLastWalPosition() > positionBefore) {
> // we have something that can push, break
> break;
> } else {
> checker.waitUntilCanPush(entry, firstCellInEntryBeforeFiltering);
> }
> }
> } catch (IOException e) {
> LOG.warn("failed to check whether we can push the WAL entries", e);
> if (batch.getLastWalPosition() > positionBefore) {
> // we have something that can push, break
> break;
> }
> sleepMultiplier = sleep(sleepMultiplier);
> }
> // <--- continue here after exception is caught
> // arrive here means we can push the entry, record the last sequence id
> batch.setLastSeqId(Bytes.toString(entry.getKey().getEncodedRegionName()),
> entry.getKey().getSequenceId());
> // actually remove the entry.
> removeEntryFromStream(entryStream, batch);
> if (addEntryToBatch(batch, entry)) {
> break;
> }
> {code}
>
> h2. IOException Example 1)
> Regionserver is terminating, causing `{_}java.io.IOException: connection is
> closed{_}` when scanning meta table for barriers. RS shutdown may race with
> shipper finishing replicating the entry:
>
> {code:java}
> 2025-08-01T18:15:46,477 WARN
> [regionserver/home-host-1:16020.replicationSource.wal-reader.home-host-1%2C16020%2C1754068134363,peer_2]
> regionserver.SerialReplicationSourceWALReader: failed to check whether we
> can push the WAL entries
> java.io.IOException: connection is closed
> at
> org.apache.hadoop.hbase.MetaTableAccessor.getMetaHTable(MetaTableAccessor.java:236)
> ~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
> at
> org.apache.hadoop.hbase.MetaTableAccessor.getReplicationBarrierResult(MetaTableAccessor.java:2041)
> ~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
> at
> org.apache.hadoop.hbase.replication.regionserver.SerialReplicationChecker.canPush(SerialReplicationChecker.java:187)
> ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
> at
> org.apache.hadoop.hbase.replication.regionserver.SerialReplicationChecker.waitUntilCanPush(SerialReplicationChecker.java:268)
> ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
> at
> org.apache.hadoop.hbase.replication.regionserver.SerialReplicationSourceWALReader.readWALEntries(SerialReplicationSourceWALReader.java:89)
> ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:177)
> ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
> at
> org.apache.hadoop.hbase.replication.regionserver.SerialReplicationSourceWALReader.run(SerialReplicationSourceWALReader.java:35)
> ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
> {code}
>
> h2. IOException Example 2)
> Timeout reading barriers from hbase:meta table:
> {code:java}
> 2025-08-06T11:42:10,495 WARN
> [regionserver/home-host-1:16020.replicationSource,peer_1.replicationSource.wal-reader.home-host-1%2C16020%2C1754475014225,peer_1]
> regionserver.SerialReplicationSourceWALReader: failed to check whether we
> can push the WAL entries
> java.io.IOException: Failed to get result within timeout, timeout=60000ms
> at
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:250)
> ~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
> at
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:53)
> ~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
> at
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithoutRetries(RpcRetryingCallerImpl.java:206)
> ~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
> at
> org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:281)
> ~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
> at
> org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:450)
> ~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
> at
> org.apache.hadoop.hbase.client.ClientScanner.nextWithSyncCache(ClientScanner.java:324)
> ~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
> at
> org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:622)
> ~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
> at
> org.apache.hadoop.hbase.MetaTableAccessor.getReplicationBarrierResult(MetaTableAccessor.java:2043)
> ~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
> at
> org.apache.hadoop.hbase.replication.regionserver.SerialReplicationChecker.canPush(SerialReplicationChecker.java:187)
> ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
> at
> org.apache.hadoop.hbase.replication.regionserver.SerialReplicationChecker.canPush(SerialReplicationChecker.java:262)
> ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
> at
> org.apache.hadoop.hbase.replication.regionserver.SerialReplicationSourceWALReader.readWALEntries(SerialReplicationSourceWALReader.java:84)
> ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
> at
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:177)
> ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
> at
> org.apache.hadoop.hbase.replication.regionserver.SerialReplicationSourceWALReader.run(SerialReplicationSourceWALReader.java:35)
> ~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
> {code}
> h2. Other IOException's
> It's possible that reading _pushedSeqId_ from zookeeper can also throw an
> IOException.
> h1. Impact
> This bug breaks serial replication guarantees (entries must be pushed in
> order based on their seqId).
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)