[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree

Yi Liu (JIRA) Wed, 30 Sep 2015 18:31:39 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14934598#comment-14934598
 ]


Yi Liu edited comment on HDFS-9053 at 10/1/15 1:30 AM:
-------------------------------------------------------

Thanks Jing for the further review. 
{quote}
Yeah, sorry for the delay.
{quote}
NP, thanks for the help :)

I also think it's a critical part of the code, thanks a lot for the review, 
[~szetszwo]! also welcome [~kihwal] and anyone who can review this.

{quote}
For insert/delete, the running time is O( n) for ArrayList.
{quote}
Yeah, if we count the re-allocations and copies of big arrays, I just mean the 
search time complexity before we do insert/delete.

{quote}
How about memory usage? One reason to use ArrayList instead of a more 
complicated data structure is to save memory. It seems to me that using B-Tree 
increases memory usage in general. It increases memory usage dramatically in 
some worst cases such as the average branching factor of the tree is small 
(this may also be a common case).
{quote}
That's a great question here.  I have given few description in the my second 
comment above: 
[here|https://issues.apache.org/jira/browse/HDFS-9053?focusedCommentId=14740814&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14740814],
 I would give a detailed explanation:
{quote}
in some worst cases such as the average branching factor of the tree is small
{quote}
Don't worry about this, for any B-Tree node except root, it contains at least 
min degree size of elements, otherwise there is merging or element shift to 
guarantee this property. Besides, we expand the elements in B-Tree node the 
same as ArrayList.

Generally say, the B-Tree in the patch increases *very few* memory usage, and 
can be ignored:
# *For normal and small size directories*, let's say it's (INode) children size 
is not large and then B-Tree only contains one node. Currently the default 
degree of B-Tree used in INodeDirectory is 1024, so the max degree is 2047, 
that means one B-Tree node can at most contain 2047 elements. And the children 
of the B-Tree root node is null, so the increased memory usage compared to 
ArrayList is: one object overhead + few variables, total increase about (16 
bytes object overhead + 4 int degree + 8 bytes reference to root node + 4 int 
size + 8 bytes null reference children + 4 bytes int childrenSize) = 44 bytes, 
and assume it's 64-bits system/JVM.
# *For large directories*, let's say it's (INode) children size is larger than 
the max degree and then B-Tree contains more than one nodes. For any B-Tree 
node except root, it contains at least min degree size elements, in our case, 
it's 1023, furthermore, we expand elements of B-Tree node same as ArrayList, so 
it's the same as ArrayList at this point, typically the worst case for B-Tree 
from memory point view is all elements are in-order, in this case, when we 
split a B-Tree node, we allocate the elements size of memory and do expanding, 
but no more elements are added later in the split left B-Tree node, and there 
is a waste of 1/3 size of elements array memory, but actually it's almost the 
same as the worst cases even for ArrayList if there is no/few elements added 
after expanding, but in practice, the elements are not in order.  Now back to 
the overhead, for B-Tree, for every B-Tree node, the increased memory usage is: 
one object overhead + one element pointer in the father + 2 variables, so total 
increase about (2* 16 two object overhead + 8 bytes reference to node in the 
father +  4 bytes int elementsSize + 8 bytes reference to children + 4 bytes in 
childrenSize) = 56 bytes for every new added B-Tree node, but one B-Tree node 
contains at least (degree - 1) elements, and at most (2* degree - 1) elements.  
Another small increased memory is the object overhead of the children in the 
B-Tree inner node for every (degree - 1) to (2* degree - 1) children B-Tree 
nodes.

In conclusion, the increased memory used for B-Tree compared to ArrayList is 
very small, for small/normal directories, it's only 44 bytes overhead which is 
even less than memory of one block in NN, for large directories, besides 
additional increased (4 bytes int degree + 8 bytes reference to root node + 4 
bytes int size) = 16 bytes memory for the variables in B-Tree , it will 
increase about 56 bytes memory for every new added B-Tree node which can 
contain 1023 ~ 2047 INodes in the directory in our case.  We can ignore these 
few memory overhead for a directory. 

And last, the benefits of B-Tree is great and obvious as described in the JIRA 
description, also in the patch, I consider the memory overhead carefully while 
implementing the B-Tree. 

Please let me know if you have further questions. Thanks a lot.


was (Author: hitliuyi):
Thanks Jing for the further review. 
{quote}
Yeah, sorry for the delay.
{quote}
NP, thanks for the help :)

I also think it's a critical part of the code, thanks a lot for the review, 
[~szetszwo]! also welcome [~kihwal] and anyone who can review this.

{quote}
For insert/delete, the running time is O( n) for ArrayList.
{quote}
Yeah, if we count the re-allocations and copies of big arrays, I just mean the 
search time complexity before we do insert/delete.

{quote}
How about memory usage? One reason to use ArrayList instead of a more 
complicated data structure is to save memory. It seems to me that using B-Tree 
increases memory usage in general. It increases memory usage dramatically in 
some worst cases such as the average branching factor of the tree is small 
(this may also be a common case).
{quote}
That's a great question here.  I have given few description in the my second 
comment above: 
[here|https://issues.apache.org/jira/browse/HDFS-9053?focusedCommentId=14740814&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14740814],
 I would give a detailed explanation:
{quote}
in some worst cases such as the average branching factor of the tree is small
{quote}
Don't worry about this, for any B-Tree node except root, it contains at least 
min degree size of elements, otherwise there is merging or element shift to 
guarantee this property. Besides, we expand the elements in B-Tree node the 
same as ArrayList.

Generally say, the B-Tree in the patch increases *very few* memory usage, and 
can be ignored:
# *For normal and small size directories*, let's say it's (INode) children size 
is not large and then B-Tree only contains one node. Currently the default 
degree of B-Tree used in INodeDirectory is 1024, so the max degree is 2047, 
that means one B-Tree node can at most contain 2047 elements. And the children 
of the B-Tree root node is null, so the increased memory usage compared to 
ArrayList is: one object overhead + few variables, total increase about (12 
bytes object overhead + 4 int degree + 8 bytes reference to root node + 4 int 
size + 8 bytes null reference children + 4 bytes int childrenSize) = 40 bytes, 
and assume it's 64-bits system/JVM.
# *For large directories*, let's say it's (INode) children size is larger than 
the max degree and then B-Tree contains more than one nodes. For any B-Tree 
node except root, it contains at least min degree size elements, in our case, 
it's 1023, furthermore, we expand elements of B-Tree node same as ArrayList, so 
it's the same as ArrayList at this point, typically the worst case for B-Tree 
from memory point view is all elements are in-order, in this case, when we 
split a B-Tree node, we allocate the elements size of memory and do expanding, 
but no more elements are added later in the split left B-Tree node, and there 
is a waste of 1/3 size of elements array memory, but actually it's almost the 
same as the worst cases even for ArrayList if there is no/few elements added 
after expanding, but in practice, the elements are not in order.  Now back to 
the overhead, for B-Tree, for every B-Tree node, the increased memory usage is: 
one object overhead + one element pointer in the father + 2 variables, so total 
increase about (2* 12 two object overhead + 8 bytes reference to node in the 
father +  4 bytes int elementsSize + 8 bytes reference to children + 4 bytes in 
childrenSize) = 48 bytes for every new added B-Tree node, but one B-Tree node 
contains at least (degree - 1) elements, and at most (2* degree - 1) elements.  
Another small increased memory is the object overhead of the children in the 
B-Tree inner node for every (degree - 1) to (2* degree - 1) children B-Tree 
nodes.

In conclusion, the increased memory used for B-Tree compared to ArrayList is 
very small, for small/normal directories, it's only 40 bytes overhead which is 
even less than memory of one block in NN, for large directories, besides 
additional increased (4 bytes int degree + 8 bytes reference to root node + 4 
bytes int size) = 16 bytes memory for the variables in B-Tree , it will 
increase about 48 bytes memory for every new added B-Tree node which can 
contain 1023 ~ 2047 INodes in the directory in our case.  We can ignore these 
few memory overhead for a directory. 

And last, the benefits of B-Tree is great and obvious as described in the JIRA 
description, also in the patch, I consider the memory overhead carefully while 
implementing the B-Tree. 

Please let me know if you have further questions. Thanks a lot.

> Support large directories efficiently using B-Tree
> --------------------------------------------------
>
>                 Key: HDFS-9053
>                 URL: https://issues.apache.org/jira/browse/HDFS-9053
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>            Reporter: Yi Liu
>            Assignee: Yi Liu
>            Priority: Critical
>         Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete, the time complexity is 
> O\(n), (the search is O(log n), but insertion/deleting causes re-allocations 
> and copies of arrays), for large directory, the operations are expensive.  If 
> the children grow to 1M size, the ArrayList will resize to > 1M capacity, so 
> need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) 
> continuous heap memory, it easily causes full GC in HDFS cluster where 
> namenode heap memory is already highly used.  I recap the 3 main issues:
> # Insertion/deletion operations in large directories are expensive because 
> re-allocations and copies of big arrays.
> # Dynamically allocate several MB continuous heap memory which will be 
> long-lived can easily cause full GC problem.
> # Even most children are removed later, but the directory INode still 
> occupies same size heap memory, since the ArrayList will never shrink.
> This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to 
> solve the problem suggested by [~shv]. 
> So the target of this JIRA is to implement a low memory footprint B-Tree and 
> use it to replace ArrayList. 
> If the elements size is not large (less than the maximum degree of B-Tree 
> node), the B-Tree only has one root node which contains an array for the 
> elements. And if the size grows large enough, it will split automatically, 
> and if elements are removed, then B-Tree nodes can merge automatically (see 
> more: https://en.wikipedia.org/wiki/B-tree).  It will solve the above 3 
> issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree

Reply via email to