[ 
https://issues.apache.org/jira/browse/HDFS-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14368324#comment-14368324
 ] 

Colin Patrick McCabe commented on HDFS-6658:
--------------------------------------------

I didn't get a chance to look at the patch, but [~clamb] and I read the design 
doc in detail.

If I understand this correctly, the goals of this patch are better GC behavior 
and slightly smaller memory consumption for large heaps.  The patch 
accomplishes this by making use of the most memory-efficient structures java 
has to offer: primitive arrays.  The patch is actually a small regression in 
heap memory size for smaller heaps, though (<32 GB).

My biggest concern is that I don't see how this data structure could be 
parallelized.  And parallelization is an important goal of the block manager 
scalability work in HDFS-7836.  In fact, this data structure seems to tie us 
more and more into the single-threaded world, since we now have more tightly 
coupled data structures to keep consistent.

Another concern is complexity and (lack of) safety.  I feel like this approach 
has all the complexity of off-heap (manual serialization, packing bits, 
possibility for reading the wrong data if there is a bug) without the positive 
aspects of off-heap (ability to grow and reduce memory consumption seamlessly 
via malloc, a "safe mode" where all allocations are checked, possibility of 
serializing block data to disk later on.)  The off-heap code we wrote does not 
use JNI, and has a mode which checks all allocations and accesses.  I don't see 
anything similar here (although maybe it could be added?)  The complexity will 
remain, though.

This is a bit of a tangent, but I still think that off-heap is a better 
solution than large arrays of primitives.  For one thing, if we off-heap the 
inodes map and the blocks map, we can potentially get the JVM size below 32GB 
and start getting the benefits of pointer compression again.  These benefits 
are not only memory usage, but also speed, since the bytecode gets smaller.  
For another thing, we basically have to write our own malloc implementation if 
we want to go down the "on-heap arrays" road.  All of the malloc 
implementations I've seen were either quite complex (jemalloc, tcmalloc) or 
inflexible (various arena allocators that give out fixed-size chunks of memory, 
kind of like the one in this patch), or crude and inefficient (a homemade 
malloc with a few lists of blocks of size 2^N)

I think the basic approach might be good, but we need to talk more about how it 
fits into other improvements.  I also think this belongs in a branch, possibly 
HDFS-7836.

> Namenode memory optimization - Block replicas list 
> ---------------------------------------------------
>
>                 Key: HDFS-6658
>                 URL: https://issues.apache.org/jira/browse/HDFS-6658
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>    Affects Versions: 2.4.1
>            Reporter: Amir Langer
>            Assignee: Daryn Sharp
>         Attachments: BlockListOptimizationComparison.xlsx, BlocksMap 
> redesign.pdf, HDFS-6658.patch, HDFS-6658.patch, HDFS-6658.patch, Namenode 
> Memory Optimizations - Block replicas list.docx, New primative indexes.jpg, 
> Old triplets.jpg
>
>
> Part of the memory consumed by every BlockInfo object in the Namenode is a 
> linked list of block references for every DatanodeStorageInfo (called 
> "triplets"). 
> We propose to change the way we store the list in memory. 
> Using primitive integer indexes instead of object references will reduce the 
> memory needed for every block replica (when compressed oops is disabled) and 
> in our new design the list overhead will be per DatanodeStorageInfo and not 
> per block replica.
> see attached design doc. for details and evaluation results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to