[ https://issues.apache.org/jira/browse/HDFS-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13003113#comment-13003113 ]
dhruba borthakur commented on HDFS-1667: ---------------------------------------- Hi Matt, how about if we then do #2 first, then take it in stages? we have 3000 machines in our cluster, each machine has about 250K blocks. So I agree with you that anything we do to decrease the cpu needed for block report processing would be good. > Consider redesign of block report processing > -------------------------------------------- > > Key: HDFS-1667 > URL: https://issues.apache.org/jira/browse/HDFS-1667 > Project: Hadoop HDFS > Issue Type: Improvement > Components: name-node > Affects Versions: 0.22.0 > Reporter: Matt Foley > Assignee: Matt Foley > > The current implementation of block reporting has the datanode send the > entire list of blocks resident on that node in every report, hourly. > BlockManager.processReport in the namenode runs each block report through > reportDiff() first, to build four linked lists of replicas for different > dispositions, then processes each list. During that process, every block > belonging to the node (even the unchanged blocks) are removed and re-linked > in that node's blocklist. The entire process happens under the global > FSNamesystem write lock, so there is essentially zero concurrency. It takes > about 90 milliseconds to process a single 50,000-replica block report in the > namenode, during which no other read or write activities can occur. > There are several opportunities for improvement in this design. In order of > probable benefit, they include: > 1. Change the datanode to send a differential report. This saves the > namenode from having to do most of the work in reportDiff(), and avoids the > need to re-link all the unchanged blocks during the "diff" calculation. > 2. Keep the linked lists of "to do" work, but modify reportDiff() so that it > can be done under a read lock instead of the write lock. Then only the > processing of the lists needs to be under the write lock. Since the number > of blocks changed is usually much smaller than the number unchanged, this > should improve concurrency. > 3. Eliminate the linked lists and just immediately process each changed block > as it is read from the block report. The work on HDFS-1295 indicates that > this might save a large fraction of the total block processing time at scale, > due to the much smaller number of objects created and garbage collected > during processing of hundreds of millions of blocks. > 4. As a sidelight to #3, remove linked list use from BlockManager.addBlock(). > It currently uses linked lists as an argument to processReportedBlock() even > though addBlock() only processes a single replica on each call. This should > be replaced with a call to immediately process the block instead of enqueuing > it. -- This message is automatically generated by JIRA. - For more information on JIRA, see: http://www.atlassian.com/software/jira