[
https://issues.apache.org/jira/browse/HBASE-26120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Kyle Purtell resolved HBASE-26120.
-----------------------------------------
Hadoop Flags: Reviewed
Resolution: Fixed
> New replication gets stuck or data loss when multiwal groups more than 10
> -------------------------------------------------------------------------
>
> Key: HBASE-26120
> URL: https://issues.apache.org/jira/browse/HBASE-26120
> Project: HBase
> Issue Type: Bug
> Components: Replication
> Affects Versions: 1.7.1, 2.4.5
> Reporter: Jasee Tao
> Assignee: Duo Zhang
> Priority: Critical
> Fix For: 2.5.0, 1.7.2, 2.3.7, 2.4.5
>
>
> {code:java}
> void preLogRoll(Path newLog) throws IOException {
> recordLog(newLog);
> String logName = newLog.getName();
> String logPrefix = DefaultWALProvider.getWALPrefixFromWALName(logName);
> synchronized (latestPaths) {
> Iterator<Path> iterator = latestPaths.iterator();
> while (iterator.hasNext()) {
> Path path = iterator.next();
> if (path.getName().contains(logPrefix)) {
> iterator.remove();
> break;
> }
> }
> this.latestPaths.add(newLog);
> }
> }
> {code}
> ReplicationSourceManager use _latestPaths_ to track each walgroup's last
> WALlog and all of them will be enqueue for replication when new replication
> peer added。
> If we set hbase.wal.regiongrouping.numgroups > 10, says 12, the name of
> WALlog group will be _regionserver.null0.timestamp_ to
> _regionserver.null11.timestamp_。*_String.contains_* is used in _preoLogRoll_
> to replace old logs in same group, leads when _regionserver.null1.ts_ comes,
> _regionserver.null11.ts_ may be replaced, and *_latestPaths_ growing with
> wrong logs*.
> Replication then partly stuckd as _regionsserver.null1.ts_ not exists on
> hdfs, and data may not be replicated to slave as _regionserver.null11.ts_ not
> in replication queue at startup.
> Because of
> [ZOOKEEPER-706|https://issues.apache.org/jira/browse/ZOOKEEPER-706], if there
> is too many logs in zk _/hbase/replication/rs/regionserver/peer_, remove_peer
> may not delete this znode, and other regionserver can't not pick up this
> queue for replication failover.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)