[ https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15413846#comment-15413846 ]
Daryn Sharp commented on HDFS-10301: ------------------------------------ My main objections (other than the fatal bug) are the incompatible change to the protocol coupled with essentially a malformed block report buffer. It's an attempt to shoehorn into the block report processing what should be handled by a heartbeat's storage reports. I think when you say my compatibility concern was addressed, it wasn't code fixed, but stated as don't-do-that? Won't the empty storage reports in the last rpc cause an older NN to go into a replication storm? Full downtime on a ~5k cluster to rollback, then ~40 mins to go active, is unacceptable when a failover to the prior release would have worked if not for this patch. This approach will also negate asynchronously processing FBRs (like I did with IBRs). Zombies should be handled by the heartbeat's pruning of excess storages. As an illustration, shouldn't something close to this work? {code} --- a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeDescriptor.java +++ b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeDescriptor.java @@ -466,11 +466,16 @@ public void updateHeartbeatState(StorageReport[] reports, long cacheCapacity, setLastUpdateMonotonic(Time.monotonicNow()); this.volumeFailures = volFailures; this.volumeFailureSummary = volumeFailureSummary; + + boolean storagesUpToDate = true; for (StorageReport report : reports) { DatanodeStorageInfo storage = updateStorage(report.getStorage()); if (checkFailedStorages) { failedStorageInfos.remove(storage); } + // don't prune unless block reports for all the storages in the + // heartbeat have been processed + storagesUpToDate &= (storage.getLastBlockReportId() == curBlockReportId); storage.receivedHeartbeat(report); totalCapacity += report.getCapacity(); @@ -492,7 +497,8 @@ public void updateHeartbeatState(StorageReport[] reports, long cacheCapacity, synchronized (storageMap) { storageMapSize = storageMap.size(); } - if (storageMapSize != reports.length) { + if (curBlockReportId != 0 + ? storagesUpToDate : storageMapSize != reports.length) { pruneStorageMap(reports); } } @@ -527,6 +533,7 @@ private void pruneStorageMap(final StorageReport[] reports) { // This can occur until all block reports are received. LOG.debug("Deferring removal of stale storage {} with {} blocks", storageInfo, storageInfo.numBlocks()); + storageInfo.setState(DatanodeStorage.State.FAILED); } } } {code} The next heartbeat after all reports are sent triggers the pruning. Other changes are required, such as removal of much of the context processing code similar to the current patch. > BlockReport retransmissions may lead to storages falsely being declared > zombie if storage report processing happens out of order > -------------------------------------------------------------------------------------------------------------------------------- > > Key: HDFS-10301 > URL: https://issues.apache.org/jira/browse/HDFS-10301 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.6.1 > Reporter: Konstantin Shvachko > Assignee: Vinitha Reddy Gankidi > Priority: Critical > Fix For: 2.7.4 > > Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, > HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, > HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch, > HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.011.patch, > HDFS-10301.012.patch, HDFS-10301.013.patch, HDFS-10301.branch-2.7.patch, > HDFS-10301.branch-2.patch, HDFS-10301.sample.patch, zombieStorageLogs.rtf > > > When NameNode is busy a DataNode can timeout sending a block report. Then it > sends the block report again. Then NameNode while process these two reports > at the same time can interleave processing storages from different reports. > This screws up the blockReportId field, which makes NameNode think that some > storages are zombie. Replicas from zombie storages are immediately removed, > causing missing blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org