[jira] [Commented] (HBASE-26120) New replication gets stuck or data loss when multiwal groups more than 10

Duo Zhang (Jira) Mon, 26 Jul 2021 18:42:04 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-26120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17387673#comment-17387673
 ]


Duo Zhang commented on HBASE-26120:
-----------------------------------

Let me provide a fix.

> New replication gets stuck or data loss when multiwal groups more than 10
> -------------------------------------------------------------------------
>
>                 Key: HBASE-26120
>                 URL: https://issues.apache.org/jira/browse/HBASE-26120
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 1.7.1, 2.4.5
>            Reporter: Jasee Tao
>            Assignee: Duo Zhang
>            Priority: Critical
>
> {code:java}
> void preLogRoll(Path newLog) throws IOException {
>   recordLog(newLog);
>   String logName = newLog.getName();
>   String logPrefix = DefaultWALProvider.getWALPrefixFromWALName(logName);
>   synchronized (latestPaths) {
>     Iterator<Path> iterator = latestPaths.iterator();
>     while (iterator.hasNext()) {
>       Path path = iterator.next();
>       if (path.getName().contains(logPrefix)) {
>         iterator.remove();
>         break;
>       }
>     }
>     this.latestPaths.add(newLog);
>   }
> }
> {code}
> ReplicationSourceManager use _latestPaths_ to track each walgroup's last 
> WALlog and all of them will be enqueue for replication when new replication  
> peer added。
> If we set hbase.wal.regiongrouping.numgroups > 10, says 12, the name of 
> WALlog group will be _regionserver.null0.timestamp_ to 
> _regionserver.null11.timestamp_。*_String.contains_* is used in _preoLogRoll_ 
> to replace old logs in same group, leads when _regionserver.null1.ts_ comes, 
> _regionserver.null11.ts_ may be replaced, and *_latestPaths_ growing with 
> wrong logs*.
> Replication then partly stuckd as _regionsserver.null1.ts_ not exists on 
> hdfs, and data may not be replicated to slave as _regionserver.null11.ts_ not 
> in replication queue at startup.
> Because of 
> [ZOOKEEPER-706|https://issues.apache.org/jira/browse/ZOOKEEPER-706], if there 
> is too many logs in zk _/hbase/replication/rs/regionserver/peer_, remove_peer 
> may not delete this znode, and other regionserver can't not pick up this 
> queue for replication failover. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HBASE-26120) New replication gets stuck or data loss when multiwal groups more than 10

Reply via email to