[ 
https://issues.apache.org/jira/browse/HDFS-7923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14589157#comment-14589157
 ] 

Colin Patrick McCabe commented on HDFS-7923:
--------------------------------------------

bq. Can a DN's block report be delayed for some significant period of time or 
due to subtle bug even long times

So it's important to distinguish between sending block reports and processing 
block reports.  This patch delays sending block reports, but it should not 
delay processing block reports by any significant amount.  The idea is that in 
general sending a bunch of block reports that can't be processed until much 
later is bad (for the reasons discussed above like GC problems, lack of RPC 
handler threads, memory consumption, etc.)  But the patch should keep the FBRs 
flowing pretty regularly... we will still queue up 6 of them on the NN even 
though we can only process 1 at once.

bq. Does your design have a safety net - say a DN will wait a max of 2 periods 
to get permission (or something like that).

This is kind of like a traffic light, right?  If the traffic light is red for a 
long time, there must be a problem somewhere else in the system.  But the 
solution can't be to slam on the accelerator when the red light lasts too long. 
 You'll just crash, especially in a traffic jam.

Maybe car analogies are taking it too far, but hopefully you can see what I'm 
saying.  I think sending FBRs when the system is not ready for them is a really 
bad behavior.  It leads to congestion collapse, which is much worse than 
starving a few DNs for a while.

Hmm. What if we had a metric which was the average length of time the DN had to 
wait before sending a full block report that it wanted to send?  Management 
systems could follow this metric and raise an alert when the time got too high. 
 Then the admin can do something to solve the problem.

> The DataNodes should rate-limit their full block reports by asking the NN on 
> heartbeat messages
> -----------------------------------------------------------------------------------------------
>
>                 Key: HDFS-7923
>                 URL: https://issues.apache.org/jira/browse/HDFS-7923
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>    Affects Versions: 2.8.0
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>             Fix For: 2.8.0
>
>         Attachments: HDFS-7923.000.patch, HDFS-7923.001.patch, 
> HDFS-7923.002.patch, HDFS-7923.003.patch, HDFS-7923.004.patch, 
> HDFS-7923.006.patch, HDFS-7923.007.patch
>
>
> The DataNodes should rate-limit their full block reports.  They can do this 
> by first sending a heartbeat message to the NN with an optional boolean set 
> which requests permission to send a full block report.  If the NN responds 
> with another optional boolean set, the DN will send an FBR... if not, it will 
> wait until later.  This can be done compatibly with optional fields.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to