[ https://issues.apache.org/jira/browse/HDFS-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14345572#comment-14345572 ]
Todd Lipcon commented on HDFS-6658: ----------------------------------- {quote} 1 the balancer 2 datanode manager start/stop of decommission 3 decommission manager scans 4 removal of failed storages 5 removal of a dead nodes (ie. all its storages) Balancing and decommissioning already have abysmal effects on performance. We cannot afford for either to be any worse. {code} Lemme play devil's advocate for a minute (it's always easier to solve these problems in a JIRA comment compared to actually in the code, so I'll defer to your judgment here, given I've been pretty inactive the last year or so): *Balancer*: the main operation it needs to do is "give me some blocks from storage X which total at least Y bytes", right? Given that, we don't need to scan the entirety of the blocksmap. We can scan the blocksmap in hash-order until we accumulate the right number of bytes. Even though we have to "skip over" blocks from the wrong storage, it's still a linear scan, and scalable in the total number of blocks. Plus, the balancer is typically removing blocks from many DNs at the same time (the overloaded ones), and the scan work can actually be combined pretty easily. I haven't thought about it quite enough, but my intuition is that, in a typical balancer run, a substantial fraction of the DNs are acting as "source" nodes, and thus the combined scan may actually end up being equally efficient as several separate DN-specific scans. *Decom*: while we don't want decom to affect other NN operations, it doesn't seem like the actual latency of the decom work on the NN is particularly important, right? So long as the work can be chunked up or done concurrently to other work, it's OK if the "start decom" process takes several seconds to scan the entire block map. Again, this work could be batched across all nodes that need decommissioning. Curious if you did any experiments with this, or just back-of-the-envelope math? FWIW I definitely agree this work could be done separately - didn't mean to derail your JIRA, just figured this would be an appropriate place to have the discussion. > Namenode memory optimization - Block replicas list > --------------------------------------------------- > > Key: HDFS-6658 > URL: https://issues.apache.org/jira/browse/HDFS-6658 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode > Affects Versions: 2.4.1 > Reporter: Amir Langer > Assignee: Daryn Sharp > Attachments: BlockListOptimizationComparison.xlsx, BlocksMap > redesign.pdf, HDFS-6658.patch, Namenode Memory Optimizations - Block replicas > list.docx > > > Part of the memory consumed by every BlockInfo object in the Namenode is a > linked list of block references for every DatanodeStorageInfo (called > "triplets"). > We propose to change the way we store the list in memory. > Using primitive integer indexes instead of object references will reduce the > memory needed for every block replica (when compressed oops is disabled) and > in our new design the list overhead will be per DatanodeStorageInfo and not > per block replica. > see attached design doc. for details and evaluation results. -- This message was sent by Atlassian JIRA (v6.3.4#6332)