[ https://issues.apache.org/jira/browse/HBASE-26960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
chenglei updated HBASE-26960: ----------------------------- Status: Patch Available (was: Open) > Another case for unnecessary replication suspending in RegionReplicationSink > ---------------------------------------------------------------------------- > > Key: HBASE-26960 > URL: https://issues.apache.org/jira/browse/HBASE-26960 > Project: HBase > Issue Type: Bug > Components: read replicas > Affects Versions: 3.0.0-alpha-2 > Reporter: chenglei > Assignee: chenglei > Priority: Major > > Besides HBASE-26768, there is another case replication in > {{RegionReplicationSink}} would be suspend: > For {{RegionReplicationSink}}, when there is a replication error , > {{RegionReplicationSink}} invokes {{MemStoreFlusher#requestFlush}} to request > a flush, and after receiving the {{FlushAction#START_FLUSH}} or > {{FlushAction#CANNOT_FLUSH}} flush marker, it would resume the replication. > But when {{MemStoreFlusher}} flushing, it invokes following method > {{HRegion.flushcache}} with the {{writeFlushRequestWalMarker}} set to false: > {code:java} > public FlushResultImpl flushcache(List<byte[]> families, > boolean writeFlushRequestWalMarker, FlushLifeCycleTracker tracker) > throws IOException { > } > {code} > When {{writeFlushRequestWalMarker}} is set to false, {{HRegion.flushcache}} > does not write the {{FlushAction#CANNOT_FLUSH}} flush marker to {{WAL}} when > the memstore is empty, just as following > {{HRegion.writeFlushRequestMarkerToWAL}} illustrated: > {code:java} > private boolean writeFlushRequestMarkerToWAL(WAL wal, boolean > writeFlushWalMarker) { > if (writeFlushWalMarker && wal != null && !writestate.readOnly) { > FlushDescriptor desc = > ProtobufUtil.toFlushDescriptor(FlushAction.CANNOT_FLUSH, > getRegionInfo(), -1, new TreeMap<>(Bytes.BYTES_COMPARATOR)); > try { > WALUtil.writeFlushMarker(wal, this.getReplicationScope(), > getRegionInfo(), desc, true, mvcc, > regionReplicationSink.orElse(null)); > return true; > } catch (IOException e) { > LOG.warn(getRegionInfo().getEncodedName() + " : " + > "Received exception while trying to write the flush request to > wal", e); > } > } > return false; > } > {code} > so when there is a replication error when the memstore is empty(eg. > replicating the {{FlushAction#START_FLUSH}} or {{FlushAction#COMMIT_FLUSH}} > ), the replication may suspend until next memstore flush,even though later > there are user writes and it could replicate normally. > I simulate this problem in the PR , and for {{writeFlushRequestWalMarker}} > paramter, it is introduced by HBASE-11580 and just only determines whether or > not writing the {{FlushAction#CANNOT_FLUSH}} flush marker to WAL when the > memstore is empty, so I think for simplicity, we could set it to true always > for {{MemStoreFlusher}}. -- This message was sent by Atlassian Jira (v8.20.7#820007)