[jira] [Commented] (HDFS-7923) The DataNodes should rate-limit their full block reports by asking the NN on heartbeat messages

Colin Patrick McCabe (JIRA) Fri, 29 May 2015 15:24:13 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-7923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14565561#comment-14565561
 ]


Colin Patrick McCabe commented on HDFS-7923:
--------------------------------------------

bq. Missing config key documentation in hdfs-defaults.xml

added

bq. requestBlockReportLeaseId: empty catch for unregistered node, we could add 
some more informative logging rather than relying on the warn below

added

bq. I discussed the NodeData structure with Colin offline, wondering why we 
didn't use a standard Collection. Colin brought up the reason of reducing 
garbage, which seems valid. I think we should consider implementing 
IntrusiveCollection though rather than writing another.

yes, there will be quite a few of these requests coming in at any given point.  
IntrusiveCollection is an interface rather than an implementation, so I don't 
think it would help here (it's most useful when an element needs to be in 
multiple lists at once, and when you need fancy operations like finding the 
list from the element)

bq. I also asked about putting NodeData into DatanodeDescriptor. Not sure what 
the conclusion was on this, it might reduce garbage since we don't need a 
separate NodeData object.

The locking is easier to understand if all the lease data is inside 
{{BlockReportLeaseManager}}.

bq. I prefer Precondition checks for invalid configuration values at startup, 
so there aren't any surprises for the user. Not everyone reads the messages on 
startup.

ok

bq. requestLease has a check for isTraceEnabled, then logs at debug level

fixed

bq. In offerService, we ignore the new leaseID if we already have one. On the 
NN though, a new request wipes out the old leaseID, and processReport checks 
based on leaseID rather than node. This kind of bug makes me wonder why we 
really need the leaseID at all, why not just attach a boolean to the node? Or 
if it's in the deferred vs. pending list?

It's safer for the NameNode to wipe the old lease ID every time there is a new 
request.  It avoids problems where the DN went down while holding a lease, and 
then came back up.  We could potentially also avoid those problems by being 
very careful with node (un)registration, but why make things more complicated 
than they need to be?  I do think that the DN should overwrite its old lease ID 
if the NN gives it a new one, for the same reason.  Let me change it to do 
that... Of course this code path should never happen since the NN should never 
give a new lease ID when none was requested.  So calling this a "bug" seems 
like a bit of a stretch.

I prefer IDs to simply checking against the datanode UUID, because lease IDs 
allow us to match up the NN granting a lease with the DN accepting and using 
it, which is very useful for debugging or understanding what is happening in 
production.  It also makes it very obvious whether a DN is "cheating" by 
sending a block report with leaseID = 0 to disable rate-limiting.  This is a 
use-case we want to support but we also want to know when it is going on.

bq. Can we fix the javadoc for scheduleBlockReport to mention randomness, and 
not "send...at the next heartbeat?" Incorrect right now.

I looked pretty far back into the history of this code.  It seems to go back to 
at least 2009.  The underlying ideas seem to be:
1. the first full block report can have a configurable delay in seconds 
expressed by {{dfs.blockreport.initialDelay}}
2. the second full block report gets a random delay between 0 and 
{{dfs.blockreport.intervalMsec}}
3. all other block reports get an interval of {{dfs.blockreport.intervalMsec}} 
*unless* the previous block report had a longer interval than expected... if 
the previous one had a longer interval than expected, the next one gets a 
shorter interval.

We can keep behavior #1... it's simple to implement and may be useful for 
testing (although I think this patch makes it no longer necessary).

Behavior #2 seems like a workaround for the lack of congestion control in the 
past.  In a world where the NN rate-limits full block reports, we don't need 
this behavior to prevent FBRs from "clumping".  They will just naturally not 
overly clump because we are rate-limiting them.

Behavior #3 just seems incorrect, even without this patch.  By definition, a 
full block report contains all the information the NN needs to understand the 
DN state.  Just because block report interval N was longer than expected, seems 
no reason to shorten block report interval N+1.  In fact, this behavior seems 
like it could lead to congestion collapse... if the NN gets overloaded and 
can't handle block reports for some time, a bunch of DNs will shorten the time 
in between the current block report and the next one, further increasing total 
NN load.  Not good.  Not good at all.

I replaced this with a simple "randomize first block report time within 0 and 
{{dfs.blockreport.initialDelay}}, then try to do all other block reports after 
{{dfs.blockreport.intervalMsec}} ms.  If the full block report interval was 
more than 2x what was configured, we whine about it in the log file (should 
only happen if the NN is under extreme load).

bq. Have you thought about moving the BR scheduler to the NN side? We still 
rely on the DNs to jitter themselves and do the initial delay, but we could 
have the NN handle all this. This would also let the NN trigger FBRs whenever 
it wants. We could also do better than random scheduling, i.e. stride it rather 
than jitter. Incompatible, so we probably won't, but fun to think about 

Yeah, we could do more on the NN to ensure fairness and so forth.  I think it's 
not really a big problem since datanodes aren't evil, and the existing method 
of configuring BR period on the datanode side seems to be working well.  We 
also tend to assume uniform cluster load in HDFS, an assumption which makes 
complex BR scheduling less interesting.  But maybe some day...

bq. Could we do the BRLManager register/unregister in addDatanode and 
removeDatanode? I think this is safe, since on a DN restart it'll provide a 
lease ID of 0 and FBR, even without a reg/unreg.

Seems reasonable.  I moved the BRLM register/unregister calls from 
registerDatanode into addDatanode / removeDatanode.

> The DataNodes should rate-limit their full block reports by asking the NN on 
> heartbeat messages
> -----------------------------------------------------------------------------------------------
>
>                 Key: HDFS-7923
>                 URL: https://issues.apache.org/jira/browse/HDFS-7923
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>         Attachments: HDFS-7923.000.patch, HDFS-7923.001.patch, 
> HDFS-7923.002.patch, HDFS-7923.003.patch
>
>
> The DataNodes should rate-limit their full block reports.  They can do this 
> by first sending a heartbeat message to the NN with an optional boolean set 
> which requests permission to send a full block report.  If the NN responds 
> with another optional boolean set, the DN will send an FBR... if not, it will 
> wait until later.  This can be done compatibly with optional fields.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7923) The DataNodes should rate-limit their full block reports by asking the NN on heartbeat messages

Reply via email to