[
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14934598#comment-14934598
]
Yi Liu edited comment on HDFS-9053 at 10/1/15 1:30 AM:
-------------------------------------------------------
Thanks Jing for the further review.
{quote}
Yeah, sorry for the delay.
{quote}
NP, thanks for the help :)
I also think it's a critical part of the code, thanks a lot for the review,
[~szetszwo]! also welcome [~kihwal] and anyone who can review this.
{quote}
For insert/delete, the running time is O( n) for ArrayList.
{quote}
Yeah, if we count the re-allocations and copies of big arrays, I just mean the
search time complexity before we do insert/delete.
{quote}
How about memory usage? One reason to use ArrayList instead of a more
complicated data structure is to save memory. It seems to me that using B-Tree
increases memory usage in general. It increases memory usage dramatically in
some worst cases such as the average branching factor of the tree is small
(this may also be a common case).
{quote}
That's a great question here. I have given few description in the my second
comment above:
[here|https://issues.apache.org/jira/browse/HDFS-9053?focusedCommentId=14740814&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14740814],
I would give a detailed explanation:
{quote}
in some worst cases such as the average branching factor of the tree is small
{quote}
Don't worry about this, for any B-Tree node except root, it contains at least
min degree size of elements, otherwise there is merging or element shift to
guarantee this property. Besides, we expand the elements in B-Tree node the
same as ArrayList.
Generally say, the B-Tree in the patch increases *very few* memory usage, and
can be ignored:
# *For normal and small size directories*, let's say it's (INode) children size
is not large and then B-Tree only contains one node. Currently the default
degree of B-Tree used in INodeDirectory is 1024, so the max degree is 2047,
that means one B-Tree node can at most contain 2047 elements. And the children
of the B-Tree root node is null, so the increased memory usage compared to
ArrayList is: one object overhead + few variables, total increase about (16
bytes object overhead + 4 int degree + 8 bytes reference to root node + 4 int
size + 8 bytes null reference children + 4 bytes int childrenSize) = 44 bytes,
and assume it's 64-bits system/JVM.
# *For large directories*, let's say it's (INode) children size is larger than
the max degree and then B-Tree contains more than one nodes. For any B-Tree
node except root, it contains at least min degree size elements, in our case,
it's 1023, furthermore, we expand elements of B-Tree node same as ArrayList, so
it's the same as ArrayList at this point, typically the worst case for B-Tree
from memory point view is all elements are in-order, in this case, when we
split a B-Tree node, we allocate the elements size of memory and do expanding,
but no more elements are added later in the split left B-Tree node, and there
is a waste of 1/3 size of elements array memory, but actually it's almost the
same as the worst cases even for ArrayList if there is no/few elements added
after expanding, but in practice, the elements are not in order. Now back to
the overhead, for B-Tree, for every B-Tree node, the increased memory usage is:
one object overhead + one element pointer in the father + 2 variables, so total
increase about (2* 16 two object overhead + 8 bytes reference to node in the
father + 4 bytes int elementsSize + 8 bytes reference to children + 4 bytes in
childrenSize) = 56 bytes for every new added B-Tree node, but one B-Tree node
contains at least (degree - 1) elements, and at most (2* degree - 1) elements.
Another small increased memory is the object overhead of the children in the
B-Tree inner node for every (degree - 1) to (2* degree - 1) children B-Tree
nodes.
In conclusion, the increased memory used for B-Tree compared to ArrayList is
very small, for small/normal directories, it's only 44 bytes overhead which is
even less than memory of one block in NN, for large directories, besides
additional increased (4 bytes int degree + 8 bytes reference to root node + 4
bytes int size) = 16 bytes memory for the variables in B-Tree , it will
increase about 56 bytes memory for every new added B-Tree node which can
contain 1023 ~ 2047 INodes in the directory in our case. We can ignore these
few memory overhead for a directory.
And last, the benefits of B-Tree is great and obvious as described in the JIRA
description, also in the patch, I consider the memory overhead carefully while
implementing the B-Tree.
Please let me know if you have further questions. Thanks a lot.
was (Author: hitliuyi):
Thanks Jing for the further review.
{quote}
Yeah, sorry for the delay.
{quote}
NP, thanks for the help :)
I also think it's a critical part of the code, thanks a lot for the review,
[~szetszwo]! also welcome [~kihwal] and anyone who can review this.
{quote}
For insert/delete, the running time is O( n) for ArrayList.
{quote}
Yeah, if we count the re-allocations and copies of big arrays, I just mean the
search time complexity before we do insert/delete.
{quote}
How about memory usage? One reason to use ArrayList instead of a more
complicated data structure is to save memory. It seems to me that using B-Tree
increases memory usage in general. It increases memory usage dramatically in
some worst cases such as the average branching factor of the tree is small
(this may also be a common case).
{quote}
That's a great question here. I have given few description in the my second
comment above:
[here|https://issues.apache.org/jira/browse/HDFS-9053?focusedCommentId=14740814&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14740814],
I would give a detailed explanation:
{quote}
in some worst cases such as the average branching factor of the tree is small
{quote}
Don't worry about this, for any B-Tree node except root, it contains at least
min degree size of elements, otherwise there is merging or element shift to
guarantee this property. Besides, we expand the elements in B-Tree node the
same as ArrayList.
Generally say, the B-Tree in the patch increases *very few* memory usage, and
can be ignored:
# *For normal and small size directories*, let's say it's (INode) children size
is not large and then B-Tree only contains one node. Currently the default
degree of B-Tree used in INodeDirectory is 1024, so the max degree is 2047,
that means one B-Tree node can at most contain 2047 elements. And the children
of the B-Tree root node is null, so the increased memory usage compared to
ArrayList is: one object overhead + few variables, total increase about (12
bytes object overhead + 4 int degree + 8 bytes reference to root node + 4 int
size + 8 bytes null reference children + 4 bytes int childrenSize) = 40 bytes,
and assume it's 64-bits system/JVM.
# *For large directories*, let's say it's (INode) children size is larger than
the max degree and then B-Tree contains more than one nodes. For any B-Tree
node except root, it contains at least min degree size elements, in our case,
it's 1023, furthermore, we expand elements of B-Tree node same as ArrayList, so
it's the same as ArrayList at this point, typically the worst case for B-Tree
from memory point view is all elements are in-order, in this case, when we
split a B-Tree node, we allocate the elements size of memory and do expanding,
but no more elements are added later in the split left B-Tree node, and there
is a waste of 1/3 size of elements array memory, but actually it's almost the
same as the worst cases even for ArrayList if there is no/few elements added
after expanding, but in practice, the elements are not in order. Now back to
the overhead, for B-Tree, for every B-Tree node, the increased memory usage is:
one object overhead + one element pointer in the father + 2 variables, so total
increase about (2* 12 two object overhead + 8 bytes reference to node in the
father + 4 bytes int elementsSize + 8 bytes reference to children + 4 bytes in
childrenSize) = 48 bytes for every new added B-Tree node, but one B-Tree node
contains at least (degree - 1) elements, and at most (2* degree - 1) elements.
Another small increased memory is the object overhead of the children in the
B-Tree inner node for every (degree - 1) to (2* degree - 1) children B-Tree
nodes.
In conclusion, the increased memory used for B-Tree compared to ArrayList is
very small, for small/normal directories, it's only 40 bytes overhead which is
even less than memory of one block in NN, for large directories, besides
additional increased (4 bytes int degree + 8 bytes reference to root node + 4
bytes int size) = 16 bytes memory for the variables in B-Tree , it will
increase about 48 bytes memory for every new added B-Tree node which can
contain 1023 ~ 2047 INodes in the directory in our case. We can ignore these
few memory overhead for a directory.
And last, the benefits of B-Tree is great and obvious as described in the JIRA
description, also in the patch, I consider the memory overhead carefully while
implementing the B-Tree.
Please let me know if you have further questions. Thanks a lot.
> Support large directories efficiently using B-Tree
> --------------------------------------------------
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: namenode
> Reporter: Yi Liu
> Assignee: Yi Liu
> Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.
> Currently we use an ArrayList for the children under a directory, and the
> children are ordered in the list, for insert/delete, the time complexity is
> O\(n), (the search is O(log n), but insertion/deleting causes re-allocations
> and copies of arrays), for large directory, the operations are expensive. If
> the children grow to 1M size, the ArrayList will resize to > 1M capacity, so
> need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM)
> continuous heap memory, it easily causes full GC in HDFS cluster where
> namenode heap memory is already highly used. I recap the 3 main issues:
> # Insertion/deletion operations in large directories are expensive because
> re-allocations and copies of big arrays.
> # Dynamically allocate several MB continuous heap memory which will be
> long-lived can easily cause full GC problem.
> # Even most children are removed later, but the directory INode still
> occupies same size heap memory, since the ArrayList will never shrink.
> This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to
> solve the problem suggested by [~shv].
> So the target of this JIRA is to implement a low memory footprint B-Tree and
> use it to replace ArrayList.
> If the elements size is not large (less than the maximum degree of B-Tree
> node), the B-Tree only has one root node which contains an array for the
> elements. And if the size grows large enough, it will split automatically,
> and if elements are removed, then B-Tree nodes can merge automatically (see
> more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3
> issues.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)