[ 
https://issues.apache.org/jira/browse/HDFS-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13003113#comment-13003113
 ] 

dhruba borthakur commented on HDFS-1667:
----------------------------------------

Hi Matt, how about if we then do #2 first, then take it in stages?

we have 3000 machines in our cluster, each machine has about 250K blocks. So I 
agree with you that anything we do to decrease the cpu needed for block report 
processing would be good.

> Consider redesign of block report processing
> --------------------------------------------
>
>                 Key: HDFS-1667
>                 URL: https://issues.apache.org/jira/browse/HDFS-1667
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: name-node
>    Affects Versions: 0.22.0
>            Reporter: Matt Foley
>            Assignee: Matt Foley
>
> The current implementation of block reporting has the datanode send the 
> entire list of blocks resident on that node in every report, hourly.  
> BlockManager.processReport in the namenode runs each block report through 
> reportDiff() first, to build four linked lists of replicas for different 
> dispositions, then processes each list.  During that process, every block 
> belonging to the node (even the unchanged blocks) are removed and re-linked 
> in that node's blocklist.  The entire process happens under the global 
> FSNamesystem write lock, so there is essentially zero concurrency.  It takes 
> about 90 milliseconds to process a single 50,000-replica block report in the 
> namenode, during which no other read or write activities can occur.
> There are several opportunities for improvement in this design.  In order of 
> probable benefit, they include:
> 1. Change the datanode to send a differential report.  This saves the 
> namenode from having to do most of the work in reportDiff(), and avoids the 
> need to re-link all the unchanged blocks during the "diff" calculation.
> 2. Keep the linked lists of "to do" work, but modify reportDiff() so that it 
> can be done under a read lock instead of the write lock.  Then only the 
> processing of the lists needs to be under the write lock.  Since the number 
> of blocks changed is usually much smaller than the number unchanged, this 
> should improve concurrency.
> 3. Eliminate the linked lists and just immediately process each changed block 
> as it is read from the block report.  The work on HDFS-1295 indicates that 
> this might save a large fraction of the total block processing time at scale, 
> due to the much smaller number of objects created and garbage collected 
> during processing of hundreds of millions of blocks.
> 4. As a sidelight to #3, remove linked list use from BlockManager.addBlock(). 
>  It currently uses linked lists as an argument to processReportedBlock() even 
> though addBlock() only processes a single replica on each call.  This should 
> be replaced with a call to immediately process the block instead of enqueuing 
> it.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to