[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14934598#comment-14934598
 ] 

Yi Liu commented on HDFS-9053:
------------------------------

Thanks Jing for the further review. 
{quote}
Yeah, sorry for the delay.
{quote}
NP, thanks for the help :)

I also think it's a critical part of the code, thanks a lot for the review, 
[~szetszwo]! also welcome [~kihwal] and anyone who can review this.

{quote}
For insert/delete, the running time is O( n) for ArrayList.
{quote}
Yeah, if we count the re-allocations and copies of big arrays, I just mean the 
search time complexity before we do insert/delete.

{quote}
How about memory usage? One reason to use ArrayList instead of a more 
complicated data structure is to save memory. It seems to me that using B-Tree 
increases memory usage in general. It increases memory usage dramatically in 
some worst cases such as the average branching factor of the tree is small 
(this may also be a common case).
{quote}
That's a great question here.  I have given few description in the my second 
comment above: 
[here|https://issues.apache.org/jira/browse/HDFS-9053?focusedCommentId=14740814&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14740814],
 I would give a detailed explanation:
Generally say, the B-Tree in the patch increases *very few* memory usage, and 
can be ignored:
# *For normal and small size directories*, let's say it's (INode) children size 
is not large and then B-Tree only contains one node. Currently the default 
degree of B-Tree used in INodeDirectory is 1024, so the max degree is 2047, 
that means one B-Tree node can at most contain 2047 elements. And the children 
of the B-Tree root node is null, so the increased memory usage compared to 
ArrayList is: one object overhead + few variables, total increase about 
(12+4+4+4+4+4) = 32 bytes.
# *For large directories*, let's say it's (INode) children size is larger than 
the max degree and then B-Tree contains more than one nodes. For any B-Tree 
node except root, it will contains at least min degree size elements, in our 
case, it's 1023, furthermore, we expand elements of B-Tree node same as 
ArrayList, so it's the same as ArrayList at this point, typically the worst 
case for B-Tree from memory point view is all elements are in-order, in this 
case, when we split a B-Tree node, we allocate the elements size of memory and 
do expanding, but no more elements are added later in the split left B-Tree 
node, and there is a waste of 1/3 size of elements array memory, but actually 
it's almost the same as the worst cases even for ArrayList if there is no/few 
elements added, but in practice, the elements are not in order.  Now back to 
the overhead, for B-Tree, for every B-Tree node, the increased memory usage is: 
one object overhead + one element pointer in the father + 2 variables, so total 
increase about (12 + 4+4+4) = 24 bytes for every new added B-Tree node, but one 
B-Tree node contains at least (degree - 1) elements, and at most (2* degree - 
1) elements.  Another small increased memory is the object overhead of the 
children in the B-Tree inner node.

In conclusion, the increased memory used for B-Tree compared to ArrayList is 
very small, for small/normal directories, it's only 32 bytes overhead which is 
even less than memory of one block in NN, for large directories, besides 
additional increased 12 bytes memory for the variables in B-Tree , it will 
increase about 24 bytes memory for every new added B-Tree node which can 
contain 1023 ~ 2047 INodes in the directory in our case.  We can ignore these 
few memory overhead for a directory. 

And last, the benefits of B-Tree is great and obvious as described in the JIRA 
description, also in the patch, I consider the memory overhead carefully while 
implementing the B-Tree. 

Please let me know if you have further questions. Thanks a lot.

> Support large directories efficiently using B-Tree
> --------------------------------------------------
>
>                 Key: HDFS-9053
>                 URL: https://issues.apache.org/jira/browse/HDFS-9053
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>            Reporter: Yi Liu
>            Assignee: Yi Liu
>            Priority: Critical
>         Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete/search, the time 
> complexity is O(log n), but insertion/deleting causes re-allocations and 
> copies of big arrays, so the operations are costly.  For example, if the 
> children grow to 1M size, the ArrayList will resize to > 1M capacity, so need 
> > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS 
> cluster where namenode heap memory is already highly used.  I recap the 3 
> main issues:
> # Insertion/deletion operations in large directories are expensive because 
> re-allocations and copies of big arrays.
> # Dynamically allocate several MB continuous heap memory which will be 
> long-lived can easily cause full GC problem.
> # Even most children are removed later, but the directory INode still 
> occupies same size heap memory, since the ArrayList will never shrink.
> This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to 
> solve the problem suggested by [~shv]. 
> So the target of this JIRA is to implement a low memory footprint B-Tree and 
> use it to replace ArrayList. 
> If the elements size is not large (less than the maximum degree of B-Tree 
> node), the B-Tree only has one root node which contains an array for the 
> elements. And if the size grows large enough, it will split automatically, 
> and if elements are removed, then B-Tree nodes can merge automatically (see 
> more: https://en.wikipedia.org/wiki/B-tree).  It will solve the above 3 
> issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to