[ 
https://issues.apache.org/jira/browse/HDFS-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14345572#comment-14345572
 ] 

Todd Lipcon commented on HDFS-6658:
-----------------------------------

{quote}
1 the balancer
2 datanode manager start/stop of decommission
3 decommission manager scans
4 removal of failed storages
5 removal of a dead nodes (ie. all its storages)
Balancing and decommissioning already have abysmal effects on performance. We 
cannot afford for either to be any worse.
{code}

Lemme play devil's advocate for a minute (it's always easier to solve these 
problems in a JIRA comment compared to actually in the code, so I'll defer to 
your judgment here, given I've been pretty inactive the last year or so):

*Balancer*: the main operation it needs to do is "give me some blocks from 
storage X which total at least Y bytes", right? Given that, we don't need to 
scan the entirety of the blocksmap. We can scan the blocksmap in hash-order 
until we accumulate the right number of bytes. Even though we have to "skip 
over" blocks from the wrong storage, it's still a linear scan, and scalable in 
the total number of blocks. Plus, the balancer is typically removing blocks 
from many DNs at the same time (the overloaded ones), and the scan work can 
actually be combined pretty easily. I haven't thought about it quite enough, 
but my intuition is that, in a typical balancer run, a substantial fraction of 
the DNs are acting as "source" nodes, and thus the combined scan may actually 
end up being equally efficient as several separate DN-specific scans.

*Decom*: while we don't want decom to affect other NN operations, it doesn't 
seem like the actual latency of the decom work on the NN is particularly 
important, right? So long as the work can be chunked up or done concurrently to 
other work, it's OK if the "start decom" process takes several seconds to scan 
the entire block map. Again, this work could be batched across all nodes that 
need decommissioning.

Curious if you did any experiments with this, or just back-of-the-envelope math?

FWIW I definitely agree this work could be done separately - didn't mean to 
derail your JIRA, just figured this would be an appropriate place to have the 
discussion.



> Namenode memory optimization - Block replicas list 
> ---------------------------------------------------
>
>                 Key: HDFS-6658
>                 URL: https://issues.apache.org/jira/browse/HDFS-6658
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>    Affects Versions: 2.4.1
>            Reporter: Amir Langer
>            Assignee: Daryn Sharp
>         Attachments: BlockListOptimizationComparison.xlsx, BlocksMap 
> redesign.pdf, HDFS-6658.patch, Namenode Memory Optimizations - Block replicas 
> list.docx
>
>
> Part of the memory consumed by every BlockInfo object in the Namenode is a 
> linked list of block references for every DatanodeStorageInfo (called 
> "triplets"). 
> We propose to change the way we store the list in memory. 
> Using primitive integer indexes instead of object references will reduce the 
> memory needed for every block replica (when compressed oops is disabled) and 
> in our new design the list overhead will be per DatanodeStorageInfo and not 
> per block replica.
> see attached design doc. for details and evaluation results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to