[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949847#comment-14949847 ]
Yi Liu commented on HDFS-9053: ------------------------------ Hi [~szetszwo], thanks for your comments. I will do as your suggestion: use an array list if the children size is small (<= 4K), otherwise use B-Tree. But there may be some differences for the implementation details. {quote} The field in INodeDirectory is List<INode> children which may refer to either an ArrayList or a BTreeList. We may replace the list at runtime {quote} It's hard to use a List<Node>, since when we use ArrayList, we do searching before index/delete and access through index, but BTree is in-order and we don't access through index. So I think we do it in following way: # actually I made an initial patch of switching between array list and b-tree few days ago. The logic of ArrayList is not complicated, so I implement it in a new class and support shrinkable, in the new class, I control the array and expanding, also the new class keeps reference to a b-tree, if the elements size becomes large, it switches to use the b-tree. The new class is an in-order data structure, not like ArrayList which we need to search before operating. In INodeDirectory, we just need to use the new class, another reason of I implement a new data structure class and don't do the switching in INodeDirectory is: we should have a {{SWITCH_THRESHOLD}} for switching from array list to b-tree, and need a low water mark to switch back, they should not be the same value, otherwise, the switching becomes frequent at some point, so I don't want to expose too many internal logic of switching in INodeDirectly. But the final memory usage of the new data structure is the same as ArrayList, even it has a reference to b-tree, and it supports Shrinkable.... I will give the memory usage after posting a new patch. {quote} I am also worry about the potiental bugs and the risk. If there is a bug in B-Tree, it is possible to lose one or more sub trees and, as a result, lose a lot of data. ArrayList is already well-tested. Anyway, we need more tests for the B-Tree, especially some long running random tests. {quote} Sure, I will add more tests for it, I have added many tests including some long running. I agree with that a bug-free data structure implementation is not easy, we should be careful and test the new implementations extensively :) > Support large directories efficiently using B-Tree > -------------------------------------------------- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode > Reporter: Yi Liu > Assignee: Yi Liu > Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)