[ https://issues.apache.org/jira/browse/HBASE-26120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17387673#comment-17387673 ]
Duo Zhang commented on HBASE-26120: ----------------------------------- Let me provide a fix. > New replication gets stuck or data loss when multiwal groups more than 10 > ------------------------------------------------------------------------- > > Key: HBASE-26120 > URL: https://issues.apache.org/jira/browse/HBASE-26120 > Project: HBase > Issue Type: Bug > Components: Replication > Affects Versions: 1.7.1, 2.4.5 > Reporter: Jasee Tao > Assignee: Duo Zhang > Priority: Critical > > {code:java} > void preLogRoll(Path newLog) throws IOException { > recordLog(newLog); > String logName = newLog.getName(); > String logPrefix = DefaultWALProvider.getWALPrefixFromWALName(logName); > synchronized (latestPaths) { > Iterator<Path> iterator = latestPaths.iterator(); > while (iterator.hasNext()) { > Path path = iterator.next(); > if (path.getName().contains(logPrefix)) { > iterator.remove(); > break; > } > } > this.latestPaths.add(newLog); > } > } > {code} > ReplicationSourceManager use _latestPaths_ to track each walgroup's last > WALlog and all of them will be enqueue for replication when new replication > peer added。 > If we set hbase.wal.regiongrouping.numgroups > 10, says 12, the name of > WALlog group will be _regionserver.null0.timestamp_ to > _regionserver.null11.timestamp_。*_String.contains_* is used in _preoLogRoll_ > to replace old logs in same group, leads when _regionserver.null1.ts_ comes, > _regionserver.null11.ts_ may be replaced, and *_latestPaths_ growing with > wrong logs*. > Replication then partly stuckd as _regionsserver.null1.ts_ not exists on > hdfs, and data may not be replicated to slave as _regionserver.null11.ts_ not > in replication queue at startup. > Because of > [ZOOKEEPER-706|https://issues.apache.org/jira/browse/ZOOKEEPER-706], if there > is too many logs in zk _/hbase/replication/rs/regionserver/peer_, remove_peer > may not delete this znode, and other regionserver can't not pick up this > queue for replication failover. -- This message was sent by Atlassian Jira (v8.3.4#803005)