[ https://issues.apache.org/jira/browse/HDFS-12914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16859246#comment-16859246 ]
star edited comment on HDFS-12914 at 6/8/19 3:56 PM: ----------------------------------------------------- [~hexiaoqiao], I also write a unit test for this issue, mostly similar to yours. Pasted here just for ref. Other than the test code, a piece of code changed. BlockManager#processReport will throw IOException to indicate an invalid lease id. Client will get the exception. {code:java} if (context != null) { if (!blockReportLeaseManager.checkLease(node, startTime, context.getLeaseId())) { throw new IOException("Invalid block report lease id '"+context.getLeaseId()+"'"); } }{code} {code:java} //Before test start conf.setLong(DFSConfigKeys.DFS_NAMENODE_FULL_BLOCK_REPORT_LEASE_LENGTH_MS, 500L); @Test public void testDelayedBlockReport() throws IOException{ FSNamesystem namesystem = cluster.getNameNode(0).getNamesystem(); BlockManager testBlockManager = Mockito.spy(namesystem.getBlockManager()); Mockito.doAnswer(new Answer<Boolean>() { @Override public Boolean answer(InvocationOnMock invocationOnMock) throws Throwable { //sleep 1000 ms to delay processing of current report Thread.sleep(1000); return (Boolean)invocationOnMock.callRealMethod(); } }).when(testBlockManager).processReport( Mockito.any(DatanodeID.class), Mockito.any(DatanodeStorage.class), Mockito.any(BlockListAsLongs.class), Mockito.any(BlockReportContext.class)); namesystem.setBlockManagerForTesting(testBlockManager); String bpid = namesystem.getBlockPoolId(); DataNode dn = cluster.getDataNodes().get(0); DatanodeRegistration dnReg = dn.getDNRegistrationForBP(bpid); namesystem.readLock(); long leaseId = testBlockManager.requestBlockReportLeaseId(dnReg); namesystem.readUnlock(); Map<DatanodeStorage, BlockListAsLongs> report = cluster.getBlockReport(bpid, 0); List<StorageBlockReport> reportList = new ArrayList<>(); for(Map.Entry<DatanodeStorage, BlockListAsLongs> en : report.entrySet()){ reportList.add(new StorageBlockReport(en.getKey(), en.getValue())); } //it will throw IOException if lease id is invalid cluster.getNameNode().getRpcServer().blockReport( dnReg, bpid, reportList.toArray(new StorageBlockReport[]{}), new BlockReportContext(1, 0, System.nanoTime(), leaseId, true)); } {code} was (Author: starphin): [~hexiaoqiao], I also write a unit test for this issue, mostly similar to yours. Pasted here just for ref. Other than the test code, a piece of code changed. BlockManager#processReport will throw IOException to indicate an invalid lease id. Client will get the exception. {code:java} if (context != null) { if (!blockReportLeaseManager.checkLease(node, startTime, context.getLeaseId())) { throw new IOException("Invalid block report lease id '"+context.getLeaseId()+"'"); } }{code} {code:java} @Test public void testDelayedBlockReport() throws IOException{ FSNamesystem namesystem = cluster.getNameNode(0).getNamesystem(); BlockManager testBlockManager = Mockito.spy(namesystem.getBlockManager()); Mockito.doAnswer(new Answer<Boolean>() { @Override public Boolean answer(InvocationOnMock invocationOnMock) throws Throwable { //sleep 1000 ms to delay processing of current report Thread.sleep(1000); return (Boolean)invocationOnMock.callRealMethod(); } }).when(testBlockManager).processReport( Mockito.any(DatanodeID.class), Mockito.any(DatanodeStorage.class), Mockito.any(BlockListAsLongs.class), Mockito.any(BlockReportContext.class)); namesystem.setBlockManagerForTesting(testBlockManager); String bpid = namesystem.getBlockPoolId(); DataNode dn = cluster.getDataNodes().get(0); DatanodeRegistration dnReg = dn.getDNRegistrationForBP(bpid); namesystem.readLock(); long leaseId = testBlockManager.requestBlockReportLeaseId(dnReg); namesystem.readUnlock(); Map<DatanodeStorage, BlockListAsLongs> report = cluster.getBlockReport(bpid, 0); List<StorageBlockReport> reportList = new ArrayList<>(); for(Map.Entry<DatanodeStorage, BlockListAsLongs> en : report.entrySet()){ reportList.add(new StorageBlockReport(en.getKey(), en.getValue())); } //it will throw IOException if lease id is invalid cluster.getNameNode().getRpcServer().blockReport( dnReg, bpid, reportList.toArray(new StorageBlockReport[]{}), new BlockReportContext(1, 0, System.nanoTime(), leaseId, true)); } {code} > Block report leases cause missing blocks until next report > ---------------------------------------------------------- > > Key: HDFS-12914 > URL: https://issues.apache.org/jira/browse/HDFS-12914 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.8.0, 2.9.2 > Reporter: Daryn Sharp > Assignee: Santosh Marella > Priority: Critical > Attachments: HDFS-12914-branch-2.001.patch, > HDFS-12914-trunk.00.patch, HDFS-12914-trunk.01.patch, HDFS-12914.005.patch, > HDFS-12914.006.patch > > > {{BlockReportLeaseManager#checkLease}} will reject FBRs from DNs for > conditions such as "unknown datanode", "not in pending set", "lease has > expired", wrong lease id, etc. Lease rejection does not throw an exception. > It returns false which bubbles up to {{NameNodeRpcServer#blockReport}} and > interpreted as {{noStaleStorages}}. > A re-registering node whose FBR is rejected from an invalid lease becomes > active with _no blocks_. A replication storm ensues possibly causing DNs to > temporarily go dead (HDFS-12645), leading to more FBR lease rejections on > re-registration. The cluster will have many "missing blocks" until the DNs > next FBR is sent and/or forced. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org