[ 
https://issues.apache.org/jira/browse/HDFS-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14371612#comment-14371612
 ] 

Daryn Sharp commented on HDFS-6658:
-----------------------------------

Thanks for your comments!  The goal is significant memory reductions for large 
heaps and increased performance via a much more GC friendly data structures.

bq.  I didn't get a chance to look at the patch ..... My biggest concern is 
that I don't see how this data structure could be parallelized ...  this data 
structure seems to tie us more and more into the single-threaded world, since 
we now have more tightly coupled data structures to keep consistent.

Please take a look at the patch before making conclusions.  I'm by no means 
trying to box us into a single-threaded world.  Pretty much all of them can be 
concurrent w/o a lot of work but that's not the (initial) goal.

bq. Another concern is complexity and (lack of) safety. ... a "safe mode" where 
all allocations are checked
The code has extensive safety checks to detect and prevent inconsistencies.  I 
detailed a few in the doc, but there's more in the code.  Essentially safe mode 
is always on.


bq. I feel like this approach has all the complexity of off-heap (manual 
serialization, packing bits, possibility for reading the wrong data if there is 
a bug) 

The code isn't as complex as you may believe.  Triplets are gone, replaced by 
storages maintaining block ids, and the replicas map has simple bit-packed 
longs with datanode id, storage index, and block index.

bq. without the positive aspects of off-heap (ability to grow and reduce memory 
consumption seamlessly via malloc, a "safe mode" where all allocations are 
checked

The data structures can all grow and shrink as necessary.  They may be sparse, 
but in practice are densely packed.  The primitive arrays are 2-dimensional so 
only the outer array needs to reallocated occasionally.  The storage's block 
capacity is allocated based on 1st block report, and the replicas map capacity 
is allocated based on # of blocks in the blocksmap.

I agree we need further discussion after you review the patch, but I don't feel 
it should be part of HDFS-7836.  It's a more ambitious and complicated change 
than swapping out the block manager data structures.  We're trying to regain 
performance lost in 2.x, and to scale our namespaces larger.  This is the 
culmination of work started last fall that requires minimal change to the BM to 
minimize risk.  We'll start testing at scale, and I'll continue to tune it, if 
we can work on getting this into branch 2.

I'm not saying this patch is 100% ready, I've got some cleanup, but the data 
structures should all be sound which is the important part to review.

> Namenode memory optimization - Block replicas list 
> ---------------------------------------------------
>
>                 Key: HDFS-6658
>                 URL: https://issues.apache.org/jira/browse/HDFS-6658
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>    Affects Versions: 2.4.1
>            Reporter: Amir Langer
>            Assignee: Daryn Sharp
>         Attachments: BlockListOptimizationComparison.xlsx, BlocksMap 
> redesign.pdf, HDFS-6658.patch, HDFS-6658.patch, HDFS-6658.patch, Namenode 
> Memory Optimizations - Block replicas list.docx, New primative indexes.jpg, 
> Old triplets.jpg
>
>
> Part of the memory consumed by every BlockInfo object in the Namenode is a 
> linked list of block references for every DatanodeStorageInfo (called 
> "triplets"). 
> We propose to change the way we store the list in memory. 
> Using primitive integer indexes instead of object references will reduce the 
> memory needed for every block replica (when compressed oops is disabled) and 
> in our new design the list overhead will be per DatanodeStorageInfo and not 
> per block replica.
> see attached design doc. for details and evaluation results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to