[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14934598#comment-14934598 ]
Yi Liu commented on HDFS-9053: ------------------------------ Thanks Jing for the further review. {quote} Yeah, sorry for the delay. {quote} NP, thanks for the help :) I also think it's a critical part of the code, thanks a lot for the review, [~szetszwo]! also welcome [~kihwal] and anyone who can review this. {quote} For insert/delete, the running time is O( n) for ArrayList. {quote} Yeah, if we count the re-allocations and copies of big arrays, I just mean the search time complexity before we do insert/delete. {quote} How about memory usage? One reason to use ArrayList instead of a more complicated data structure is to save memory. It seems to me that using B-Tree increases memory usage in general. It increases memory usage dramatically in some worst cases such as the average branching factor of the tree is small (this may also be a common case). {quote} That's a great question here. I have given few description in the my second comment above: [here|https://issues.apache.org/jira/browse/HDFS-9053?focusedCommentId=14740814&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14740814], I would give a detailed explanation: Generally say, the B-Tree in the patch increases *very few* memory usage, and can be ignored: # *For normal and small size directories*, let's say it's (INode) children size is not large and then B-Tree only contains one node. Currently the default degree of B-Tree used in INodeDirectory is 1024, so the max degree is 2047, that means one B-Tree node can at most contain 2047 elements. And the children of the B-Tree root node is null, so the increased memory usage compared to ArrayList is: one object overhead + few variables, total increase about (12+4+4+4+4+4) = 32 bytes. # *For large directories*, let's say it's (INode) children size is larger than the max degree and then B-Tree contains more than one nodes. For any B-Tree node except root, it will contains at least min degree size elements, in our case, it's 1023, furthermore, we expand elements of B-Tree node same as ArrayList, so it's the same as ArrayList at this point, typically the worst case for B-Tree from memory point view is all elements are in-order, in this case, when we split a B-Tree node, we allocate the elements size of memory and do expanding, but no more elements are added later in the split left B-Tree node, and there is a waste of 1/3 size of elements array memory, but actually it's almost the same as the worst cases even for ArrayList if there is no/few elements added, but in practice, the elements are not in order. Now back to the overhead, for B-Tree, for every B-Tree node, the increased memory usage is: one object overhead + one element pointer in the father + 2 variables, so total increase about (12 + 4+4+4) = 24 bytes for every new added B-Tree node, but one B-Tree node contains at least (degree - 1) elements, and at most (2* degree - 1) elements. Another small increased memory is the object overhead of the children in the B-Tree inner node. In conclusion, the increased memory used for B-Tree compared to ArrayList is very small, for small/normal directories, it's only 32 bytes overhead which is even less than memory of one block in NN, for large directories, besides additional increased 12 bytes memory for the variables in B-Tree , it will increase about 24 bytes memory for every new added B-Tree node which can contain 1023 ~ 2047 INodes in the directory in our case. We can ignore these few memory overhead for a directory. And last, the benefits of B-Tree is great and obvious as described in the JIRA description, also in the patch, I consider the memory overhead carefully while implementing the B-Tree. Please let me know if you have further questions. Thanks a lot. > Support large directories efficiently using B-Tree > -------------------------------------------------- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode > Reporter: Yi Liu > Assignee: Yi Liu > Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete/search, the time > complexity is O(log n), but insertion/deleting causes re-allocations and > copies of big arrays, so the operations are costly. For example, if the > children grow to 1M size, the ArrayList will resize to > 1M capacity, so need > > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS > cluster where namenode heap memory is already highly used. I recap the 3 > main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)