[ https://issues.apache.org/jira/browse/HBASE-26120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17388403#comment-17388403 ]
Hudson commented on HBASE-26120: -------------------------------- Results for branch branch-2.4 [build #169 on builds.a.o|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.4/169/]: (x) *{color:red}-1 overall{color}* ---- details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.4/169/General_20Nightly_20Build_20Report/] (/) {color:green}+1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.4/169/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/] (/) {color:green}+1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.4/169/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/] (/) {color:green}+1 jdk11 hadoop3 checks{color} -- For more information [see jdk11 report|https://ci-hadoop.apache.org/job/HBase/job/HBase%20Nightly/job/branch-2.4/169/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > New replication gets stuck or data loss when multiwal groups more than 10 > ------------------------------------------------------------------------- > > Key: HBASE-26120 > URL: https://issues.apache.org/jira/browse/HBASE-26120 > Project: HBase > Issue Type: Bug > Components: Replication > Affects Versions: 1.7.1, 2.4.5 > Reporter: Jasee Tao > Assignee: Duo Zhang > Priority: Critical > Fix For: 2.5.0, 2.3.6, 2.4.5, 1.7.2 > > > {code:java} > void preLogRoll(Path newLog) throws IOException { > recordLog(newLog); > String logName = newLog.getName(); > String logPrefix = DefaultWALProvider.getWALPrefixFromWALName(logName); > synchronized (latestPaths) { > Iterator<Path> iterator = latestPaths.iterator(); > while (iterator.hasNext()) { > Path path = iterator.next(); > if (path.getName().contains(logPrefix)) { > iterator.remove(); > break; > } > } > this.latestPaths.add(newLog); > } > } > {code} > ReplicationSourceManager use _latestPaths_ to track each walgroup's last > WALlog and all of them will be enqueue for replication when new replication > peer added。 > If we set hbase.wal.regiongrouping.numgroups > 10, says 12, the name of > WALlog group will be _regionserver.null0.timestamp_ to > _regionserver.null11.timestamp_。*_String.contains_* is used in _preoLogRoll_ > to replace old logs in same group, leads when _regionserver.null1.ts_ comes, > _regionserver.null11.ts_ may be replaced, and *_latestPaths_ growing with > wrong logs*. > Replication then partly stuckd as _regionsserver.null1.ts_ not exists on > hdfs, and data may not be replicated to slave as _regionserver.null11.ts_ not > in replication queue at startup. > Because of > [ZOOKEEPER-706|https://issues.apache.org/jira/browse/ZOOKEEPER-706], if there > is too many logs in zk _/hbase/replication/rs/regionserver/peer_, remove_peer > may not delete this znode, and other regionserver can't not pick up this > queue for replication failover. -- This message was sent by Atlassian Jira (v8.3.4#803005)