[ https://issues.apache.org/jira/browse/HDFS-6658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14368324#comment-14368324 ]
Colin Patrick McCabe commented on HDFS-6658: -------------------------------------------- I didn't get a chance to look at the patch, but [~clamb] and I read the design doc in detail. If I understand this correctly, the goals of this patch are better GC behavior and slightly smaller memory consumption for large heaps. The patch accomplishes this by making use of the most memory-efficient structures java has to offer: primitive arrays. The patch is actually a small regression in heap memory size for smaller heaps, though (<32 GB). My biggest concern is that I don't see how this data structure could be parallelized. And parallelization is an important goal of the block manager scalability work in HDFS-7836. In fact, this data structure seems to tie us more and more into the single-threaded world, since we now have more tightly coupled data structures to keep consistent. Another concern is complexity and (lack of) safety. I feel like this approach has all the complexity of off-heap (manual serialization, packing bits, possibility for reading the wrong data if there is a bug) without the positive aspects of off-heap (ability to grow and reduce memory consumption seamlessly via malloc, a "safe mode" where all allocations are checked, possibility of serializing block data to disk later on.) The off-heap code we wrote does not use JNI, and has a mode which checks all allocations and accesses. I don't see anything similar here (although maybe it could be added?) The complexity will remain, though. This is a bit of a tangent, but I still think that off-heap is a better solution than large arrays of primitives. For one thing, if we off-heap the inodes map and the blocks map, we can potentially get the JVM size below 32GB and start getting the benefits of pointer compression again. These benefits are not only memory usage, but also speed, since the bytecode gets smaller. For another thing, we basically have to write our own malloc implementation if we want to go down the "on-heap arrays" road. All of the malloc implementations I've seen were either quite complex (jemalloc, tcmalloc) or inflexible (various arena allocators that give out fixed-size chunks of memory, kind of like the one in this patch), or crude and inefficient (a homemade malloc with a few lists of blocks of size 2^N) I think the basic approach might be good, but we need to talk more about how it fits into other improvements. I also think this belongs in a branch, possibly HDFS-7836. > Namenode memory optimization - Block replicas list > --------------------------------------------------- > > Key: HDFS-6658 > URL: https://issues.apache.org/jira/browse/HDFS-6658 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode > Affects Versions: 2.4.1 > Reporter: Amir Langer > Assignee: Daryn Sharp > Attachments: BlockListOptimizationComparison.xlsx, BlocksMap > redesign.pdf, HDFS-6658.patch, HDFS-6658.patch, HDFS-6658.patch, Namenode > Memory Optimizations - Block replicas list.docx, New primative indexes.jpg, > Old triplets.jpg > > > Part of the memory consumed by every BlockInfo object in the Namenode is a > linked list of block references for every DatanodeStorageInfo (called > "triplets"). > We propose to change the way we store the list in memory. > Using primitive integer indexes instead of object references will reduce the > memory needed for every block replica (when compressed oops is disabled) and > in our new design the list overhead will be per DatanodeStorageInfo and not > per block replica. > see attached design doc. for details and evaluation results. -- This message was sent by Atlassian JIRA (v6.3.4#6332)