[ https://issues.apache.org/jira/browse/HDFS-7923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14565561#comment-14565561 ]
Colin Patrick McCabe commented on HDFS-7923: -------------------------------------------- bq. Missing config key documentation in hdfs-defaults.xml added bq. requestBlockReportLeaseId: empty catch for unregistered node, we could add some more informative logging rather than relying on the warn below added bq. I discussed the NodeData structure with Colin offline, wondering why we didn't use a standard Collection. Colin brought up the reason of reducing garbage, which seems valid. I think we should consider implementing IntrusiveCollection though rather than writing another. yes, there will be quite a few of these requests coming in at any given point. IntrusiveCollection is an interface rather than an implementation, so I don't think it would help here (it's most useful when an element needs to be in multiple lists at once, and when you need fancy operations like finding the list from the element) bq. I also asked about putting NodeData into DatanodeDescriptor. Not sure what the conclusion was on this, it might reduce garbage since we don't need a separate NodeData object. The locking is easier to understand if all the lease data is inside {{BlockReportLeaseManager}}. bq. I prefer Precondition checks for invalid configuration values at startup, so there aren't any surprises for the user. Not everyone reads the messages on startup. ok bq. requestLease has a check for isTraceEnabled, then logs at debug level fixed bq. In offerService, we ignore the new leaseID if we already have one. On the NN though, a new request wipes out the old leaseID, and processReport checks based on leaseID rather than node. This kind of bug makes me wonder why we really need the leaseID at all, why not just attach a boolean to the node? Or if it's in the deferred vs. pending list? It's safer for the NameNode to wipe the old lease ID every time there is a new request. It avoids problems where the DN went down while holding a lease, and then came back up. We could potentially also avoid those problems by being very careful with node (un)registration, but why make things more complicated than they need to be? I do think that the DN should overwrite its old lease ID if the NN gives it a new one, for the same reason. Let me change it to do that... Of course this code path should never happen since the NN should never give a new lease ID when none was requested. So calling this a "bug" seems like a bit of a stretch. I prefer IDs to simply checking against the datanode UUID, because lease IDs allow us to match up the NN granting a lease with the DN accepting and using it, which is very useful for debugging or understanding what is happening in production. It also makes it very obvious whether a DN is "cheating" by sending a block report with leaseID = 0 to disable rate-limiting. This is a use-case we want to support but we also want to know when it is going on. bq. Can we fix the javadoc for scheduleBlockReport to mention randomness, and not "send...at the next heartbeat?" Incorrect right now. I looked pretty far back into the history of this code. It seems to go back to at least 2009. The underlying ideas seem to be: 1. the first full block report can have a configurable delay in seconds expressed by {{dfs.blockreport.initialDelay}} 2. the second full block report gets a random delay between 0 and {{dfs.blockreport.intervalMsec}} 3. all other block reports get an interval of {{dfs.blockreport.intervalMsec}} *unless* the previous block report had a longer interval than expected... if the previous one had a longer interval than expected, the next one gets a shorter interval. We can keep behavior #1... it's simple to implement and may be useful for testing (although I think this patch makes it no longer necessary). Behavior #2 seems like a workaround for the lack of congestion control in the past. In a world where the NN rate-limits full block reports, we don't need this behavior to prevent FBRs from "clumping". They will just naturally not overly clump because we are rate-limiting them. Behavior #3 just seems incorrect, even without this patch. By definition, a full block report contains all the information the NN needs to understand the DN state. Just because block report interval N was longer than expected, seems no reason to shorten block report interval N+1. In fact, this behavior seems like it could lead to congestion collapse... if the NN gets overloaded and can't handle block reports for some time, a bunch of DNs will shorten the time in between the current block report and the next one, further increasing total NN load. Not good. Not good at all. I replaced this with a simple "randomize first block report time within 0 and {{dfs.blockreport.initialDelay}}, then try to do all other block reports after {{dfs.blockreport.intervalMsec}} ms. If the full block report interval was more than 2x what was configured, we whine about it in the log file (should only happen if the NN is under extreme load). bq. Have you thought about moving the BR scheduler to the NN side? We still rely on the DNs to jitter themselves and do the initial delay, but we could have the NN handle all this. This would also let the NN trigger FBRs whenever it wants. We could also do better than random scheduling, i.e. stride it rather than jitter. Incompatible, so we probably won't, but fun to think about Yeah, we could do more on the NN to ensure fairness and so forth. I think it's not really a big problem since datanodes aren't evil, and the existing method of configuring BR period on the datanode side seems to be working well. We also tend to assume uniform cluster load in HDFS, an assumption which makes complex BR scheduling less interesting. But maybe some day... bq. Could we do the BRLManager register/unregister in addDatanode and removeDatanode? I think this is safe, since on a DN restart it'll provide a lease ID of 0 and FBR, even without a reg/unreg. Seems reasonable. I moved the BRLM register/unregister calls from registerDatanode into addDatanode / removeDatanode. > The DataNodes should rate-limit their full block reports by asking the NN on > heartbeat messages > ----------------------------------------------------------------------------------------------- > > Key: HDFS-7923 > URL: https://issues.apache.org/jira/browse/HDFS-7923 > Project: Hadoop HDFS > Issue Type: Sub-task > Reporter: Colin Patrick McCabe > Assignee: Colin Patrick McCabe > Attachments: HDFS-7923.000.patch, HDFS-7923.001.patch, > HDFS-7923.002.patch, HDFS-7923.003.patch > > > The DataNodes should rate-limit their full block reports. They can do this > by first sending a heartbeat message to the NN with an optional boolean set > which requests permission to send a full block report. If the NN responds > with another optional boolean set, the DN will send an FBR... if not, it will > wait until later. This can be done compatibly with optional fields. -- This message was sent by Atlassian JIRA (v6.3.4#6332)