[jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

Daryn Sharp (JIRA) Tue, 09 Aug 2016 10:01:40 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15413846#comment-15413846
 ]


Daryn Sharp commented on HDFS-10301:
------------------------------------

My main objections (other than the fatal bug) are the incompatible change to 
the protocol coupled with essentially a malformed block report buffer.  It's an 
attempt to shoehorn into the block report processing what should be handled by 
a heartbeat's storage reports.

I think when you say my compatibility concern was addressed, it wasn't code 
fixed, but stated as don't-do-that?  Won't the empty storage reports in the 
last rpc cause an older NN to go into a replication storm?  Full downtime on a 
~5k cluster to rollback, then ~40 mins to go active, is unacceptable when a 
failover to the prior release would have worked if not for this patch.

This approach will also negate asynchronously processing FBRs (like I did with 
IBRs).

Zombies should be handled by the heartbeat's pruning of excess storages.  As an 
illustration, shouldn't something close to this work?
{code}
--- 
a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeDescriptor.java
+++ 
b/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeDescriptor.java
@@ -466,11 +466,16 @@ public void updateHeartbeatState(StorageReport[] reports, 
long cacheCapacity,
     setLastUpdateMonotonic(Time.monotonicNow());
     this.volumeFailures = volFailures;
     this.volumeFailureSummary = volumeFailureSummary;
+
+    boolean storagesUpToDate = true;
     for (StorageReport report : reports) {
       DatanodeStorageInfo storage = updateStorage(report.getStorage());
       if (checkFailedStorages) {
         failedStorageInfos.remove(storage);
       }
+      // don't prune unless block reports for all the storages in the
+      // heartbeat have been processed
+      storagesUpToDate &= (storage.getLastBlockReportId() == curBlockReportId);
 
       storage.receivedHeartbeat(report);
       totalCapacity += report.getCapacity();
@@ -492,7 +497,8 @@ public void updateHeartbeatState(StorageReport[] reports, 
long cacheCapacity,
     synchronized (storageMap) {
       storageMapSize = storageMap.size();
     }
-    if (storageMapSize != reports.length) {
+    if (curBlockReportId != 0
+        ? storagesUpToDate : storageMapSize != reports.length) {
       pruneStorageMap(reports);
     }
   }
@@ -527,6 +533,7 @@ private void pruneStorageMap(final StorageReport[] reports) 
{
           // This can occur until all block reports are received.
           LOG.debug("Deferring removal of stale storage {} with {} blocks",
               storageInfo, storageInfo.numBlocks());
+          storageInfo.setState(DatanodeStorage.State.FAILED);
         }
       }
     }
{code}

The next heartbeat after all reports are sent triggers the pruning.    Other 
changes are required, such as removal of much of the context processing code 
similar to the current patch.

> BlockReport retransmissions may lead to storages falsely being declared 
> zombie if storage report processing happens out of order
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-10301
>                 URL: https://issues.apache.org/jira/browse/HDFS-10301
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.6.1
>            Reporter: Konstantin Shvachko
>            Assignee: Vinitha Reddy Gankidi
>            Priority: Critical
>             Fix For: 2.7.4
>
>         Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, 
> HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, 
> HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch, 
> HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.011.patch, 
> HDFS-10301.012.patch, HDFS-10301.013.patch, HDFS-10301.branch-2.7.patch, 
> HDFS-10301.branch-2.patch, HDFS-10301.sample.patch, zombieStorageLogs.rtf
>
>
> When NameNode is busy a DataNode can timeout sending a block report. Then it 
> sends the block report again. Then NameNode while process these two reports 
> at the same time can interleave processing storages from different reports. 
> This screws up the blockReportId field, which makes NameNode think that some 
> storages are zombie. Replicas from zombie storages are immediately removed, 
> causing missing blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-10301) BlockReport retransmissions may lead to storages falsely being declared zombie if storage report processing happens out of order

Reply via email to