[jira] [Commented] (HDFS-7923) The DataNodes should rate-limit their full block reports by asking the NN on heartbeat messages

Colin Patrick McCabe (JIRA) Mon, 27 Apr 2015 17:36:30 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-7923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14516053#comment-14516053
 ]


Colin Patrick McCabe commented on HDFS-7923:
--------------------------------------------

Thanks, [~clamb].  I like this approach.  It avoids sending the block report 
until the NN requests it.  So we don't have to throw away a whole block report 
to achieve backpressure.

{code}
  public static final String  DFS_NAMENODE_MAX_CONCURRENT_BLOCK_REPORTS_KEY = 
"dfs.namenode.max.concurrent.block.reports";
  public static final int     DFS_NAMENODE_MAX_CONCURRENT_BLOCK_REPORTS_DEFAULT 
= Integer.MAX_VALUE;
{code}
It seems like this should default to something less than the default number of 
RPC handler threads, not to MAX_INT.  Given that dfs.namenode.handler.count = 
10, it seems like this should be no more than 5 or 6, right?  The main point 
here to avoid having the NN handler threads completely choked with block 
reports, and that is defeated if the value is MAX_INT.  I realize that you 
probably intended this to be configured.  But it seems like we should have a 
reasonable default that works for most people.

{code}
--- hadoop-hdfs-project/hadoop-hdfs/src/main/proto/DatanodeProtocol.proto
+++ hadoop-hdfs-project/hadoop-hdfs/src/main/proto/DatanodeProtocol.proto
@@ -195,6 +195,7 @@ message HeartbeatRequestProto {
   optional uint64 cacheCapacity = 6 [ default = 0 ];
   optional uint64 cacheUsed = 7 [default = 0 ];
   optional VolumeFailureSummaryProto volumeFailureSummary = 8;
+  optional bool requestSendFullBlockReport = 9;
 }
{code}
Let's have a {{\[default = false\]}} here so that we don't have to add a bunch 
of clunky {{HasFoo}} checks.  Unless there is something we'd like to do 
differently in the "false" and "not present" cases, but I can't think of what 
that would be.

{code}
+  /* Number of block reports currently being processed. */
+  private final AtomicInteger blockReportProcessingCount = new 
AtomicInteger(0);
{code}
I'm not sure an {{AtomicInteger}} makes sense here.  We only modify this 
variable (write to it) when holding the FSN lock in write mode, right?  And we 
only read from it when holding the FSN in read mode.  So, there isn't any need 
to add atomic ops.

{code}
+      boolean okToSendFullBlockReport = true;
+      if (requestSendFullBlockReport &&
+          blockManager.getBlockReportProcessingCount() >=
+              maxConcurrentBlockReports) {
+        /* See if we should tell DN to back off for a bit. */
+        final long lastBlockReportTime = blockManager.getDatanodeManager().
+            getDatanode(nodeReg).getLastBlockReportTime();
+        if (lastBlockReportTime > 0) {
+          /* We've received at least one block report. */
+          final long msSinceLastBlockReport = now() - lastBlockReportTime;
+          if (msSinceLastBlockReport < maxBlockReportDeferralMsec) {
+            /* It hasn't been long enough to allow a BR to pass through. */
+            okToSendFullBlockReport = false;
+          }
+        }
+      }
+      return new HeartbeatResponse(cmds, haState, rollingUpgradeInfo,
+          okToSendFullBlockReport);
{code}
There is a TOCTOU (time of check, time of use) race condition here, right?  
1000 datanodes come in and ask me whether it's ok to send an FBR.  In each 
case, I check the number of ongoing FBRs, which is 0, and say "yes."  Then 1000 
FBRs arrive all at once and the NN melts down.

I think we need to track which datanodes we gave the "green light" to, and not 
decrement the counter until they either send that report, or some timeout 
expires.  (We need the timeout in case datanodes go away after requesting 
permission-to-send.)  The timeout can probably be as short as a few minutes.  
If you can't manage to send an FBR in a few minutes, there's more problems 
going on.

{code}
  public static final String  DFS_BLOCKREPORT_MAX_DEFER_MSEC_KEY = 
"dfs.blockreport.max.deferMsec";
  public static final long    DFS_BLOCKREPORT_MAX_DEFER_MSEC_DEFAULT = 
Long.MAX_VALUE;
{code}
Do we really need this config key?  It seems like we added it because we wanted 
to avoid starvation (i.e. the case where a given DN never gets given the green 
light).  But we are maintaining the last FBR time for each DN anyway.  Surely 
we can just have a TreeMap or something and just tell the guys with the oldest 
{{lastSentTime}} to go.  There aren't an infinite number of datanodes in the 
cluster, so eventually everyone will get the green light.

I really would prefer not to have this tunable at all, since I think it's 
unnecessary.  In any case, it's certainly doing us no good as MAX_U64.

> The DataNodes should rate-limit their full block reports by asking the NN on 
> heartbeat messages
> -----------------------------------------------------------------------------------------------
>
>                 Key: HDFS-7923
>                 URL: https://issues.apache.org/jira/browse/HDFS-7923
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: Colin Patrick McCabe
>            Assignee: Charles Lamb
>         Attachments: HDFS-7923.000.patch, HDFS-7923.001.patch
>
>
> The DataNodes should rate-limit their full block reports.  They can do this 
> by first sending a heartbeat message to the NN with an optional boolean set 
> which requests permission to send a full block report.  If the NN responds 
> with another optional boolean set, the DN will send an FBR... if not, it will 
> wait until later.  This can be done compatibly with optional fields.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7923) The DataNodes should rate-limit their full block reports by asking the NN on heartbeat messages

Reply via email to