[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14934598#comment-14934598 ]
Yi Liu edited comment on HDFS-9053 at 10/1/15 1:30 AM: ------------------------------------------------------- Thanks Jing for the further review. {quote} Yeah, sorry for the delay. {quote} NP, thanks for the help :) I also think it's a critical part of the code, thanks a lot for the review, [~szetszwo]! also welcome [~kihwal] and anyone who can review this. {quote} For insert/delete, the running time is O( n) for ArrayList. {quote} Yeah, if we count the re-allocations and copies of big arrays, I just mean the search time complexity before we do insert/delete. {quote} How about memory usage? One reason to use ArrayList instead of a more complicated data structure is to save memory. It seems to me that using B-Tree increases memory usage in general. It increases memory usage dramatically in some worst cases such as the average branching factor of the tree is small (this may also be a common case). {quote} That's a great question here. I have given few description in the my second comment above: [here|https://issues.apache.org/jira/browse/HDFS-9053?focusedCommentId=14740814&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14740814], I would give a detailed explanation: {quote} in some worst cases such as the average branching factor of the tree is small {quote} Don't worry about this, for any B-Tree node except root, it contains at least min degree size of elements, otherwise there is merging or element shift to guarantee this property. Besides, we expand the elements in B-Tree node the same as ArrayList. Generally say, the B-Tree in the patch increases *very few* memory usage, and can be ignored: # *For normal and small size directories*, let's say it's (INode) children size is not large and then B-Tree only contains one node. Currently the default degree of B-Tree used in INodeDirectory is 1024, so the max degree is 2047, that means one B-Tree node can at most contain 2047 elements. And the children of the B-Tree root node is null, so the increased memory usage compared to ArrayList is: one object overhead + few variables, total increase about (16 bytes object overhead + 4 int degree + 8 bytes reference to root node + 4 int size + 8 bytes null reference children + 4 bytes int childrenSize) = 44 bytes, and assume it's 64-bits system/JVM. # *For large directories*, let's say it's (INode) children size is larger than the max degree and then B-Tree contains more than one nodes. For any B-Tree node except root, it contains at least min degree size elements, in our case, it's 1023, furthermore, we expand elements of B-Tree node same as ArrayList, so it's the same as ArrayList at this point, typically the worst case for B-Tree from memory point view is all elements are in-order, in this case, when we split a B-Tree node, we allocate the elements size of memory and do expanding, but no more elements are added later in the split left B-Tree node, and there is a waste of 1/3 size of elements array memory, but actually it's almost the same as the worst cases even for ArrayList if there is no/few elements added after expanding, but in practice, the elements are not in order. Now back to the overhead, for B-Tree, for every B-Tree node, the increased memory usage is: one object overhead + one element pointer in the father + 2 variables, so total increase about (2* 16 two object overhead + 8 bytes reference to node in the father + 4 bytes int elementsSize + 8 bytes reference to children + 4 bytes in childrenSize) = 56 bytes for every new added B-Tree node, but one B-Tree node contains at least (degree - 1) elements, and at most (2* degree - 1) elements. Another small increased memory is the object overhead of the children in the B-Tree inner node for every (degree - 1) to (2* degree - 1) children B-Tree nodes. In conclusion, the increased memory used for B-Tree compared to ArrayList is very small, for small/normal directories, it's only 44 bytes overhead which is even less than memory of one block in NN, for large directories, besides additional increased (4 bytes int degree + 8 bytes reference to root node + 4 bytes int size) = 16 bytes memory for the variables in B-Tree , it will increase about 56 bytes memory for every new added B-Tree node which can contain 1023 ~ 2047 INodes in the directory in our case. We can ignore these few memory overhead for a directory. And last, the benefits of B-Tree is great and obvious as described in the JIRA description, also in the patch, I consider the memory overhead carefully while implementing the B-Tree. Please let me know if you have further questions. Thanks a lot. was (Author: hitliuyi): Thanks Jing for the further review. {quote} Yeah, sorry for the delay. {quote} NP, thanks for the help :) I also think it's a critical part of the code, thanks a lot for the review, [~szetszwo]! also welcome [~kihwal] and anyone who can review this. {quote} For insert/delete, the running time is O( n) for ArrayList. {quote} Yeah, if we count the re-allocations and copies of big arrays, I just mean the search time complexity before we do insert/delete. {quote} How about memory usage? One reason to use ArrayList instead of a more complicated data structure is to save memory. It seems to me that using B-Tree increases memory usage in general. It increases memory usage dramatically in some worst cases such as the average branching factor of the tree is small (this may also be a common case). {quote} That's a great question here. I have given few description in the my second comment above: [here|https://issues.apache.org/jira/browse/HDFS-9053?focusedCommentId=14740814&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14740814], I would give a detailed explanation: {quote} in some worst cases such as the average branching factor of the tree is small {quote} Don't worry about this, for any B-Tree node except root, it contains at least min degree size of elements, otherwise there is merging or element shift to guarantee this property. Besides, we expand the elements in B-Tree node the same as ArrayList. Generally say, the B-Tree in the patch increases *very few* memory usage, and can be ignored: # *For normal and small size directories*, let's say it's (INode) children size is not large and then B-Tree only contains one node. Currently the default degree of B-Tree used in INodeDirectory is 1024, so the max degree is 2047, that means one B-Tree node can at most contain 2047 elements. And the children of the B-Tree root node is null, so the increased memory usage compared to ArrayList is: one object overhead + few variables, total increase about (12 bytes object overhead + 4 int degree + 8 bytes reference to root node + 4 int size + 8 bytes null reference children + 4 bytes int childrenSize) = 40 bytes, and assume it's 64-bits system/JVM. # *For large directories*, let's say it's (INode) children size is larger than the max degree and then B-Tree contains more than one nodes. For any B-Tree node except root, it contains at least min degree size elements, in our case, it's 1023, furthermore, we expand elements of B-Tree node same as ArrayList, so it's the same as ArrayList at this point, typically the worst case for B-Tree from memory point view is all elements are in-order, in this case, when we split a B-Tree node, we allocate the elements size of memory and do expanding, but no more elements are added later in the split left B-Tree node, and there is a waste of 1/3 size of elements array memory, but actually it's almost the same as the worst cases even for ArrayList if there is no/few elements added after expanding, but in practice, the elements are not in order. Now back to the overhead, for B-Tree, for every B-Tree node, the increased memory usage is: one object overhead + one element pointer in the father + 2 variables, so total increase about (2* 12 two object overhead + 8 bytes reference to node in the father + 4 bytes int elementsSize + 8 bytes reference to children + 4 bytes in childrenSize) = 48 bytes for every new added B-Tree node, but one B-Tree node contains at least (degree - 1) elements, and at most (2* degree - 1) elements. Another small increased memory is the object overhead of the children in the B-Tree inner node for every (degree - 1) to (2* degree - 1) children B-Tree nodes. In conclusion, the increased memory used for B-Tree compared to ArrayList is very small, for small/normal directories, it's only 40 bytes overhead which is even less than memory of one block in NN, for large directories, besides additional increased (4 bytes int degree + 8 bytes reference to root node + 4 bytes int size) = 16 bytes memory for the variables in B-Tree , it will increase about 48 bytes memory for every new added B-Tree node which can contain 1023 ~ 2047 INodes in the directory in our case. We can ignore these few memory overhead for a directory. And last, the benefits of B-Tree is great and obvious as described in the JIRA description, also in the patch, I consider the memory overhead carefully while implementing the B-Tree. Please let me know if you have further questions. Thanks a lot. > Support large directories efficiently using B-Tree > -------------------------------------------------- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode > Reporter: Yi Liu > Assignee: Yi Liu > Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)