[ https://issues.apache.org/jira/browse/HBASE-26482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447819#comment-17447819 ]
Duo Zhang commented on HBASE-26482: ----------------------------------- I think this is definately a problem... > HMaster may clean wals that is replicating in rare cases > -------------------------------------------------------- > > Key: HBASE-26482 > URL: https://issues.apache.org/jira/browse/HBASE-26482 > Project: HBase > Issue Type: Bug > Components: Replication > Reporter: zhuobin zheng > Priority: Critical > > In our cluster, i can found some FileNotFoundException when > ReplicationSourceWALReader running for replication recovery queue. > I guss the wal most likely removed by hmaste. And i found something to > support it. > The method getAllWALs: > [https://github.com/apache/hbase/blob/master/hbase-replication/src/main/java/org/apache/hadoop/hbase/replication/ZKReplicationQueueStorage.java#L509 > > |https://github.com/apache/hbase/blob/master/hbase-replication/src/main/java/org/apache/hadoop/hbase/replication/ZKReplicationQueueStorage.java#L509]Use > zk cversion of /hbase/replication/rs as an optimistic lock to control > concurrent ops. > But, zk cversion *only can only reflect the changes of child nodes, but not > the changes of grandchildren.* > So, HMaster may loss some wal from this method in follow situation. > # HMaster do log clean , and invoke getAllWALs to filter log which should > not be deleted. > # HMaster cache current cversion of /hbase/replication/rs as *v0* > # HMaster cache all RS server name, and traverse them, get the WAL in each > Queue > # *RS2* dead after HMaster traverse {*}RS1{*}, and before traverse *RS2* > # *RS1* claim one queue of *RS2,* which named *peerid-RS2* now > # By the way , the cversion of /hbase/replication/rs not changed before all > of *RS2* queue is removed, because the children of /hbase/replication/rs not > change. > # So, Hmaster will lost the wals in *peerid-RS2,* because we have already > traversed *RS1 ,* and ** this queue not exists in *RS2* > The above expression is currently only speculation, not confirmed > Flie Not Found Log. > > {code:java} > // code placeholder > 2021-11-22 15:18:39,593 ERROR > [ReplicationExecutor-0.replicationSource,peer_id-hostname,60020,1636802867348.replicationSource.wal-reader.hostname%2C60020%2C1636802867348,peer_id-hostname,60020,1636802867348] > regionserver.WALEntryStream: Couldn't locate log: > hdfs://namenode/hbase/oldWALs/hostname%2C60020%2C1636802867348.1636944748704 > 2021-11-22 15:18:39,593 ERROR > [ReplicationExecutor-0.replicationSource,peer_id-hostname,60020,1636802867348.replicationSource.wal-reader.hostname%2C60020%2C1636802867348,peer_id-hostname,60020,1636802867348] > regionserver.ReplicationSourceWALReader: Failed to read stream of > replication entries > java.io.FileNotFoundException: File does not exist: > hdfs://namenode/hbase/oldWALs/hostname%2C60020%2C1636802867348.1636944748704 > at > org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1612) > at > org.apache.hadoop.hdfs.DistributedFileSystem$29.doCall(DistributedFileSystem.java:1605) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1620) > at > org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:64) > at > org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.init(ProtobufLogReader.java:168) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:321) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:303) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:291) > at > org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:427) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openReader(WALEntryStream.java:355) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.openNextLog(WALEntryStream.java:303) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.checkReader(WALEntryStream.java:294) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:175) > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:101) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.readWALEntries(ReplicationSourceWALReader.java:192) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALReader.run(ReplicationSourceWALReader.java:138) > {code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001)