[ https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15385120#comment-15385120 ]
Mingliang Liu commented on HDFS-10301: -------------------------------------- Thanks for the patch, [~redvine]. I'm catching up all the insightful discussions here and learned a lot. 1. {{FSImage#isUpgradeFinalized}} is not volatile and {{nn.getFSImage().isUpgradeFinalized()}} is not holding the read lock in {{NameNodeRpcServer#blockReport()}}. Is this a problem? This is not very related to this issue though. 2. {code:title=TestNameNodePrunesMissingStorages.java} for (Future<IOException> future: futureList) { try { future.get(); } catch (Exception e) { LOG.error("Processing block report failed due to {}", e); } } {code} I think we need to interpret the return value of the future.get()? If you’re gonna process exceptions thrown by the task, I think we don’t need to return it explicitly as {{Callable.call()}} is permitted to throw checked exceptions which get propagated back to the calling thread (wrapped as {{ExecutionException}} IIRC). 3. {code:title=TestNameNodePrunesMissingStorages.java} DatanodeStorageInfo[] newStorageInfos = dnDescriptor.getStorageInfos(); Assert.assertEquals(storageInfos.length, newStorageInfos.length); for (int i = 0; i < storageInfos.length; i++) { Assert.assertTrue(storageInfos[i] == newStorageInfos[i]); } {code} do you mean {code} Assert.assertArrayEquals(storageInfos, dnDescriptor.getStorageInfos()); {code} h6. Minor comments: # We should add javadoc for {{STORAGE_REPORT}} as it’s not that straightforward defined in {{BlockListAsLongs}} abstract class. # {{assert (blockList.getNumberOfBlocks() == -1);}} I believe we don’t need to use assert statement along with {{Assert.asserEquals()}}? # Always use slf4j placeholder in the code as you are doing int he latest patch. Specifically {code:title=BlockManager.java} LOG.debug("Processing RPC with index " + context.getCurRpc() + " out of total " + context.getTotalRpcs() + " RPCs in " + "processReport 0x" + Long.toHexString(context.getReportId())); {code} We MUST use placeholder here to avoid string construction if the log level is INFO and above. More examples are:{{LOG.info("Block pool id: " + blockPoolId);}} can be simplified as {{LOG.info("Block pool id: {}“, blockPoolId);}} And for exceptions we don’t need placeholder if it’s the last parameter. So {{LOG.error("Processing block report failed due to {}", e);}} can be {{LOG.error("Processing block report failed due to ", e);}} # I see unnecessary blank lines in the v11 patch. # I see not addressed long line checkstyle warnings in {{BlockManager}} # {code} if (nn.getFSImage().isUpgradeFinalized() Set<String> storageIDsInBlockReport = new HashSet<>(); if (context.getTotalRpcs() == context.getCurRpc() + 1) { {code} can be {code} if (nn.getFSImage().isUpgradeFinalized() && context.getTotalRpcs() == context.getCurRpc() + 1) { Set<String> storageIDsInBlockReport = new HashSet<>(); {code} # {code:title=BPServiceActor.java} DatanodeCommand cmd; if () { cmd = … else { cmd = … } {code} Let’s make {{cmd}} final. > BlockReport retransmissions may lead to storages falsely being declared > zombie if storage report processing happens out of order > -------------------------------------------------------------------------------------------------------------------------------- > > Key: HDFS-10301 > URL: https://issues.apache.org/jira/browse/HDFS-10301 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.6.1 > Reporter: Konstantin Shvachko > Assignee: Vinitha Reddy Gankidi > Priority: Critical > Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, > HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, > HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch, > HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.011.patch, > HDFS-10301.sample.patch, zombieStorageLogs.rtf > > > When NameNode is busy a DataNode can timeout sending a block report. Then it > sends the block report again. Then NameNode while process these two reports > at the same time can interleave processing storages from different reports. > This screws up the blockReportId field, which makes NameNode think that some > storages are zombie. Replicas from zombie storages are immediately removed, > causing missing blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org