[ https://issues.apache.org/jira/browse/HDFS-7923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14516053#comment-14516053 ]
Colin Patrick McCabe commented on HDFS-7923: -------------------------------------------- Thanks, [~clamb]. I like this approach. It avoids sending the block report until the NN requests it. So we don't have to throw away a whole block report to achieve backpressure. {code} public static final String DFS_NAMENODE_MAX_CONCURRENT_BLOCK_REPORTS_KEY = "dfs.namenode.max.concurrent.block.reports"; public static final int DFS_NAMENODE_MAX_CONCURRENT_BLOCK_REPORTS_DEFAULT = Integer.MAX_VALUE; {code} It seems like this should default to something less than the default number of RPC handler threads, not to MAX_INT. Given that dfs.namenode.handler.count = 10, it seems like this should be no more than 5 or 6, right? The main point here to avoid having the NN handler threads completely choked with block reports, and that is defeated if the value is MAX_INT. I realize that you probably intended this to be configured. But it seems like we should have a reasonable default that works for most people. {code} --- hadoop-hdfs-project/hadoop-hdfs/src/main/proto/DatanodeProtocol.proto +++ hadoop-hdfs-project/hadoop-hdfs/src/main/proto/DatanodeProtocol.proto @@ -195,6 +195,7 @@ message HeartbeatRequestProto { optional uint64 cacheCapacity = 6 [ default = 0 ]; optional uint64 cacheUsed = 7 [default = 0 ]; optional VolumeFailureSummaryProto volumeFailureSummary = 8; + optional bool requestSendFullBlockReport = 9; } {code} Let's have a {{\[default = false\]}} here so that we don't have to add a bunch of clunky {{HasFoo}} checks. Unless there is something we'd like to do differently in the "false" and "not present" cases, but I can't think of what that would be. {code} + /* Number of block reports currently being processed. */ + private final AtomicInteger blockReportProcessingCount = new AtomicInteger(0); {code} I'm not sure an {{AtomicInteger}} makes sense here. We only modify this variable (write to it) when holding the FSN lock in write mode, right? And we only read from it when holding the FSN in read mode. So, there isn't any need to add atomic ops. {code} + boolean okToSendFullBlockReport = true; + if (requestSendFullBlockReport && + blockManager.getBlockReportProcessingCount() >= + maxConcurrentBlockReports) { + /* See if we should tell DN to back off for a bit. */ + final long lastBlockReportTime = blockManager.getDatanodeManager(). + getDatanode(nodeReg).getLastBlockReportTime(); + if (lastBlockReportTime > 0) { + /* We've received at least one block report. */ + final long msSinceLastBlockReport = now() - lastBlockReportTime; + if (msSinceLastBlockReport < maxBlockReportDeferralMsec) { + /* It hasn't been long enough to allow a BR to pass through. */ + okToSendFullBlockReport = false; + } + } + } + return new HeartbeatResponse(cmds, haState, rollingUpgradeInfo, + okToSendFullBlockReport); {code} There is a TOCTOU (time of check, time of use) race condition here, right? 1000 datanodes come in and ask me whether it's ok to send an FBR. In each case, I check the number of ongoing FBRs, which is 0, and say "yes." Then 1000 FBRs arrive all at once and the NN melts down. I think we need to track which datanodes we gave the "green light" to, and not decrement the counter until they either send that report, or some timeout expires. (We need the timeout in case datanodes go away after requesting permission-to-send.) The timeout can probably be as short as a few minutes. If you can't manage to send an FBR in a few minutes, there's more problems going on. {code} public static final String DFS_BLOCKREPORT_MAX_DEFER_MSEC_KEY = "dfs.blockreport.max.deferMsec"; public static final long DFS_BLOCKREPORT_MAX_DEFER_MSEC_DEFAULT = Long.MAX_VALUE; {code} Do we really need this config key? It seems like we added it because we wanted to avoid starvation (i.e. the case where a given DN never gets given the green light). But we are maintaining the last FBR time for each DN anyway. Surely we can just have a TreeMap or something and just tell the guys with the oldest {{lastSentTime}} to go. There aren't an infinite number of datanodes in the cluster, so eventually everyone will get the green light. I really would prefer not to have this tunable at all, since I think it's unnecessary. In any case, it's certainly doing us no good as MAX_U64. > The DataNodes should rate-limit their full block reports by asking the NN on > heartbeat messages > ----------------------------------------------------------------------------------------------- > > Key: HDFS-7923 > URL: https://issues.apache.org/jira/browse/HDFS-7923 > Project: Hadoop HDFS > Issue Type: Sub-task > Reporter: Colin Patrick McCabe > Assignee: Charles Lamb > Attachments: HDFS-7923.000.patch, HDFS-7923.001.patch > > > The DataNodes should rate-limit their full block reports. They can do this > by first sending a heartbeat message to the NN with an optional boolean set > which requests permission to send a full block report. If the NN responds > with another optional boolean set, the DN will send an FBR... if not, it will > wait until later. This can be done compatibly with optional fields. -- This message was sent by Atlassian JIRA (v6.3.4#6332)