Tomas created HBASE-29501:
-----------------------------
Summary: IOException in SerialReplicationChecker.canPush causes
entries to be pushed out of order
Key: HBASE-29501
URL: https://issues.apache.org/jira/browse/HBASE-29501
Project: HBase
Issue Type: Bug
Components: Replication
Affects Versions: 2.6.2
Reporter: Tomas
HBase version: 2.6.2-hadoop3, revision=6b3b36b429cf9a9d74110de79eb3b327b29ebf17
h1. Problem
In several HBase test clusters with serial replication enabled, observed
entries with higher sequence ID being pushed before entries with lower sequence
ID when _SerialReplicationChecker.canPush_ throws an {_}IOException{_}.
The exception is caught in
{_}SerialReplicationSourceWALReader.readWALEntries{_}. When handling the
exception instead of breaking out of the surrounding for loop the code may
continue to push the entry and record its sequence ID in zookeeper:
{code:java}
try {
if (!checker.canPush(entry, firstCellInEntryBeforeFiltering)) {
if (batch.getLastWalPosition() > positionBefore) {
// we have something that can push, break
break;
} else {
checker.waitUntilCanPush(entry, firstCellInEntryBeforeFiltering);
}
}
} catch (IOException e) {
LOG.warn("failed to check whether we can push the WAL entries", e);
if (batch.getLastWalPosition() > positionBefore) {
// we have something that can push, break
break;
}
sleepMultiplier = sleep(sleepMultiplier);
}
// <--- continue here after exception is caught
// arrive here means we can push the entry, record the last sequence id
batch.setLastSeqId(Bytes.toString(entry.getKey().getEncodedRegionName()),
entry.getKey().getSequenceId());
// actually remove the entry.
removeEntryFromStream(entryStream, batch);
if (addEntryToBatch(batch, entry)) {
break;
}
{code}
h2. IOException Example 1)
Regionserver is terminating, causing `{_}java.io.IOException: connection is
closed{_}` when scanning meta table for barriers. RS shutdown may race with
shipper finishing replicating the entry:
{code:java}
2025-08-01T18:15:46,477 WARN
[regionserver/home-host-1:16020.replicationSource.wal-reader.home-host-1%2C16020%2C1754068134363,peer_2]
regionserver.SerialReplicationSourceWALReader: failed to check whether we can
push the WAL entries
java.io.IOException: connection is closed
at
org.apache.hadoop.hbase.MetaTableAccessor.getMetaHTable(MetaTableAccessor.java:236)
~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
at
org.apache.hadoop.hbase.MetaTableAccessor.getReplicationBarrierResult(MetaTableAccessor.java:2041)
~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
at
org.apache.hadoop.hbase.replication.regionserver.SerialReplicationChecker.canPush(SerialReplicationChecker.java:187)
~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
at
org.apache.hadoop.hbase.replication.regionserver.SerialReplicationChecker.waitUntilCanPush(SerialReplicationChecker.java:268)
~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
at
org.apache.hadoop.hbase.replication.regionserver.SerialReplicationSourceWALReader.readWALEntries(SerialReplicationSourceWALReader.java:89)
~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:177)
~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
at
org.apache.hadoop.hbase.replication.regionserver.SerialReplicationSourceWALReader.run(SerialReplicationSourceWALReader.java:35)
~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
{code}
h2. IOException Example 2)
Timeout reading barriers from hbase:meta table:
{code:java}
2025-08-06T11:42:10,495 WARN
[regionserver/home-host-1:16020.replicationSource,peer_1.replicationSource.wal-reader.home-host-1%2C16020%2C1754475014225,peer_1]
regionserver.SerialReplicationSourceWALReader: failed to check whether we can
push the WAL entries
java.io.IOException: Failed to get result within timeout, timeout=60000ms
at
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:250)
~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
at
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:53)
~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
at
org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithoutRetries(RpcRetryingCallerImpl.java:206)
~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
at
org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:281)
~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
at
org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:450)
~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
at
org.apache.hadoop.hbase.client.ClientScanner.nextWithSyncCache(ClientScanner.java:324)
~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
at
org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:622)
~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
at
org.apache.hadoop.hbase.MetaTableAccessor.getReplicationBarrierResult(MetaTableAccessor.java:2043)
~[hbase-client-2.6.2-hadoop3.jar:2.6.2-hadoop3]
at
org.apache.hadoop.hbase.replication.regionserver.SerialReplicationChecker.canPush(SerialReplicationChecker.java:187)
~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
at
org.apache.hadoop.hbase.replication.regionserver.SerialReplicationChecker.canPush(SerialReplicationChecker.java:262)
~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
at
org.apache.hadoop.hbase.replication.regionserver.SerialReplicationSourceWALReader.readWALEntries(SerialReplicationSourceWALReader.java:84)
~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
at
org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:177)
~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
at
org.apache.hadoop.hbase.replication.regionserver.SerialReplicationSourceWALReader.run(SerialReplicationSourceWALReader.java:35)
~[hbase-server-2.6.2-hadoop3.jar:2.6.2-hadoop3]
{code}
h2. Other IOException's
It's possible that reading _pushedSeqId_ from zookeeper can also throw an
IOException.
h1. Impact
This bug breaks serial replication guarantees (entries must be pushed in order
based on their seqId).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)