[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree

2015-10-21 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968277#comment-14968277
 ] 

Yi Liu edited comment on HDFS-9053 at 10/22/15 12:51 AM:
-

Thanks [~jingzhao]!

{quote}
My concern is if the number of directories is limited, then maybe increasing 44 
bytes per directory is fine. E.g., for a big cluster with >100M 
files/directories, if we only have 1M directories, then we only increase the 
heap size by 44 MB. Compared with the total heap size this may be just <0.1% 
increase.
{quote}
Each directory increasing 44 bytes is the old implementation.  
In the latest patch, each directory only increase *8* bytes.  So it's quite 
small for directory.  The new approach also just uses btree only, it's NOT to 
switch between ArrayList and btree, please look at the {{007}} patch.

I know some users have more than 500M files, and some directory contains quite 
lots of files, usually this kind of large directory is accessed most 
frequently, just as "barrel principle", this becomes the bottleneck of NN.


was (Author: hitliuyi):
Thanks [~jingzhao]!

{quote}
My concern is if the number of directories is limited, then maybe increasing 44 
bytes per directory is fine. E.g., for a big cluster with >100M 
files/directories, if we only have 1M directories, then we only increase the 
heap size by 44 MB. Compared with the total heap size this may be just <0.1% 
increase.
{quote}
Each directory increasing 44 bytes is the old implementation.  
In the latest patch, each directory only increase *8* bytes.  So it's quite 
small for directory. 

I know some users have more than 500M files, and some directory contains quite 
lots of files, usually this kind of large directory is accessed most 
frequently, just as "barrel principle", this becomes the bottleneck of NN.

> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, 
> HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch, 
> HDFS-9053.007.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete, the time complexity is 
> O\(n), (the search is O(log n), but insertion/deleting causes re-allocations 
> and copies of arrays), for large directory, the operations are expensive.  If 
> the children grow to 1M size, the ArrayList will resize to > 1M capacity, so 
> need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) 
> continuous heap memory, it easily causes full GC in HDFS cluster where 
> namenode heap memory is already highly used.  I recap the 3 main issues:
> # Insertion/deletion operations in large directories are expensive because 
> re-allocations and copies of big arrays.
> # Dynamically allocate several MB continuous heap memory which will be 
> long-lived can easily cause full GC problem.
> # Even most children are removed later, but the directory INode still 
> occupies same size heap memory, since the ArrayList will never shrink.
> This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to 
> solve the problem suggested by [~shv]. 
> So the target of this JIRA is to implement a low memory footprint B-Tree and 
> use it to replace ArrayList. 
> If the elements size is not large (less than the maximum degree of B-Tree 
> node), the B-Tree only has one root node which contains an array for the 
> elements. And if the size grows large enough, it will split automatically, 
> and if elements are removed, then B-Tree nodes can merge automatically (see 
> more: https://en.wikipedia.org/wiki/B-tree).  It will solve the above 3 
> issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree

2015-10-21 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968445#comment-14968445
 ] 

Yi Liu edited comment on HDFS-9053 at 10/22/15 3:34 AM:


Got it now. Thanks a lot [~jingzhao]!
Let's recalculate again.  For original btree implementation, it increases 44 
bytes which is my estimation and ignore alignment.  Now I have looked into the 
real memory layout, it actually increases (40 + 40) - 40 arraylist = 40 bytes. 
And I can do small improvement to remove {{degree}} variable 4 bytes + 4 bytes 
alignment/padding gap = 8 bytes.
So finally if we don't let BTree extend Node, it increases *32 bytes* for a 
directly.

32 bytes memory increment for a directory is fine for me and was also same as 
my original thought.  As in your example, if we have 1M directories, then we 
only increase heap size by 32 MB.  I also respect Nicholas' comment, if we all 
think it's OK, I am happy to do this :).  


{noformat}
org.apache.hadoop.util.BTree object internals:
 OFFSET  SIZE  TYPE DESCRIPTIONVALUE
  016   (object header)N/A
 16 4   int BTree.degree   N/A
 20 4   int BTree.size N/A
 24 4   int BTree.modCount N/A
 28 4   (alignment/padding gap)N/A
 32 8  Node BTree.root N/A
Instance size: 40 bytes (estimated, the sample instance is not available)
Space losses: 4 bytes internal + 0 bytes external = 4 bytes total
{noformat}

{noformat}
org.apache.hadoop.util.BTree.Node object internals:
 OFFSET  SIZE TYPE DESCRIPTIONVALUE
  016  (object header)N/A
 16 4  int Node.elementsSize  N/A
 20 4  int Node.childrenSize  N/A
 24 8 Object[] Node.elements  N/A
 32 8 Object[] Node.children  N/A
Instance size: 40 bytes (estimated, the sample instance is not available)
Space losses: 0 bytes internal + 0 bytes external = 0 bytes total
{noformat}


was (Author: hitliuyi):
Got it now. Thanks a lot [~jingzhao]!
Let's recalculate again.  For original btree implementation, it increases 44 
bytes which is my estimation and ignore alignment.  Now I have looked into the 
real memory layout, it actually increases (40 + 40) - 40 arraylist = 40 bytes. 
And I can do small improvement to remove {{degree}} variable 4 bytes + 4 bytes 
alignment/padding gap = 8 bytes.
So finally if we don't let BTree extend Node, it increases *32 bytes* for a 
directly.

32 bytes memory increment for a directory is fine for me and was my original 
thought.  As in your example, if we have 1M directories, then we only increase 
heap size by 32 MB.  I also respect Nicholas' comment, if we all think it's OK, 
I am happy to do this :).  


{noformat}
org.apache.hadoop.util.BTree object internals:
 OFFSET  SIZE  TYPE DESCRIPTIONVALUE
  016   (object header)N/A
 16 4   int BTree.degree   N/A
 20 4   int BTree.size N/A
 24 4   int BTree.modCount N/A
 28 4   (alignment/padding gap)N/A
 32 8  Node BTree.root N/A
Instance size: 40 bytes (estimated, the sample instance is not available)
Space losses: 4 bytes internal + 0 bytes external = 4 bytes total
{noformat}

{noformat}
org.apache.hadoop.util.BTree.Node object internals:
 OFFSET  SIZE TYPE DESCRIPTIONVALUE
  016  (object header)N/A
 16 4  int Node.elementsSize  N/A
 20 4  int Node.childrenSize  N/A
 24 8 Object[] Node.elements  N/A
 32 8 Object[] Node.children  N/A
Instance size: 40 bytes (estimated, the sample instance is not available)
Space losses: 0 bytes internal + 0 bytes external = 0 bytes total
{noformat}

> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, 
> HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch, 
> HDFS-9053.007.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete, 

[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree

2015-10-15 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959955#comment-14959955
 ] 

Yi Liu edited comment on HDFS-9053 at 10/16/15 12:56 AM:
-

Thanks [~szetszwo] for the comments.

{quote}
>> 24 8 Object[] Node.elements  N/A
>> 32 8 Object[] Node.children  N/A
It only counts the reference but array objects are not counted. So the BTree 
overhead is still a lot more than ArrayList.
{quote}

For small elements size (assume # < max degree which is 2047), the {{children}} 
is null reference, so there is no array object of {{children}} here, just 8 
bytes for null reference. And for {{elements}}, {{ArrayList}} also has it. So 
as described above, {{BTree}} increases 8 bytes compared with {{ArrayList}} for 
small size elements.  Do I miss something?


was (Author: hitliuyi):
Thanks [~szetszwo] for the comments.

{quote}
>> 24 8 Object[] Node.elements  N/A
>> 32 8 Object[] Node.children  N/A
It only counts the reference but array objects are not counted. So the BTree 
overhead is still a lot more than ArrayList.
{quote}

For small elements size (assume # < max degree which is 2047), the {{children}} 
is null reference, so there is no array object of {{children}} here,  and for 
{{elements}}, {{ArrayList}} also has it. So as described above, {{BTree}} 
increases 8 bytes compared with {{ArrayList}} for small size elements.  Do I 
miss something?

> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, 
> HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete, the time complexity is 
> O\(n), (the search is O(log n), but insertion/deleting causes re-allocations 
> and copies of arrays), for large directory, the operations are expensive.  If 
> the children grow to 1M size, the ArrayList will resize to > 1M capacity, so 
> need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) 
> continuous heap memory, it easily causes full GC in HDFS cluster where 
> namenode heap memory is already highly used.  I recap the 3 main issues:
> # Insertion/deletion operations in large directories are expensive because 
> re-allocations and copies of big arrays.
> # Dynamically allocate several MB continuous heap memory which will be 
> long-lived can easily cause full GC problem.
> # Even most children are removed later, but the directory INode still 
> occupies same size heap memory, since the ArrayList will never shrink.
> This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to 
> solve the problem suggested by [~shv]. 
> So the target of this JIRA is to implement a low memory footprint B-Tree and 
> use it to replace ArrayList. 
> If the elements size is not large (less than the maximum degree of B-Tree 
> node), the B-Tree only has one root node which contains an array for the 
> elements. And if the size grows large enough, it will split automatically, 
> and if elements are removed, then B-Tree nodes can merge automatically (see 
> more: https://en.wikipedia.org/wiki/B-tree).  It will solve the above 3 
> issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree

2015-10-09 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949847#comment-14949847
 ] 

Yi Liu edited comment on HDFS-9053 at 10/9/15 8:49 AM:
---

Hi [~szetszwo], thanks for your comments.

I will do as your suggestion: use an array list if the children size is small 
(<= 4K), otherwise use B-Tree. But there may be some differences for the 
implementation details.

{quote}
The field in INodeDirectory is List children which may refer to either 
an ArrayList or a BTreeList. We may replace the list at runtime
{quote}
It's hard to use a List, since when we use ArrayList, we do searching 
before index/delete and access through index, but BTree is in-order and we 
don't access through index. So I think we do it in following way:
# actually I made an initial patch of switching between array list and b-tree 
few days ago. The logic of ArrayList is not complicated, so I implement it in a 
new class and support shrinkable, in the new class, I control the array and 
expanding, also the new class keeps reference to a b-tree, if the elements size 
becomes large, it switches to use the b-tree. The new class is an in-order data 
structure, not like ArrayList which we need to search before operating.   In 
INodeDirectory, we just need to use the new class, another reason of I 
implement a new data structure class and don't do the switching in 
INodeDirectory is: we should have a {{SWITCH_THRESHOLD}} for switching from 
array list to b-tree, and need a low water mark to switch back, they should not 
be the same value, otherwise, the switching becomes frequent at some point, so 
I don't want to expose too many internal logic of switching in INodeDirectly. I 
will give the memory usage after posting a new patch.


{quote}
 I am also worry about the potiental bugs and the risk. If there is a bug in 
B-Tree, it is possible to lose one or more sub trees and, as a result, lose a 
lot of data. ArrayList is already well-tested. Anyway, we need more tests for 
the B-Tree, especially some long running random tests.
{quote}
Sure, I will add more tests for it, I have added many tests including some long 
running. I agree with that a bug-free data structure implementation is not 
easy, we should be careful and test the new implementations extensively :) 


was (Author: hitliuyi):
Hi [~szetszwo], thanks for your comments.

I will do as your suggestion: use an array list if the children size is small 
(<= 4K), otherwise use B-Tree. But there may be some differences for the 
implementation details.

{quote}
The field in INodeDirectory is List children which may refer to either 
an ArrayList or a BTreeList. We may replace the list at runtime
{quote}
It's hard to use a List, since when we use ArrayList, we do searching 
before index/delete and access through index, but BTree is in-order and we 
don't access through index. So I think we do it in following way:
# actually I made an initial patch of switching between array list and b-tree 
few days ago. The logic of ArrayList is not complicated, so I implement it in a 
new class and support shrinkable, in the new class, I control the array and 
expanding, also the new class keeps reference to a b-tree, if the elements size 
becomes large, it switches to use the b-tree. The new class is an in-order data 
structure, not like ArrayList which we need to search before operating.   In 
INodeDirectory, we just need to use the new class, another reason of I 
implement a new data structure class and don't do the switching in 
INodeDirectory is: we should have a {{SWITCH_THRESHOLD}} for switching from 
array list to b-tree, and need a low water mark to switch back, they should not 
be the same value, otherwise, the switching becomes frequent at some point, so 
I don't want to expose too many internal logic of switching in INodeDirectly. 
But the final memory usage of the new data structure is the same as ArrayList, 
even it has a reference to b-tree, and it supports Shrinkable  I will give 
the memory usage after posting a new patch.


{quote}
 I am also worry about the potiental bugs and the risk. If there is a bug in 
B-Tree, it is possible to lose one or more sub trees and, as a result, lose a 
lot of data. ArrayList is already well-tested. Anyway, we need more tests for 
the B-Tree, especially some long running random tests.
{quote}
Sure, I will add more tests for it, I have added many tests including some long 
running. I agree with that a bug-free data structure implementation is not 
easy, we should be careful and test the new implementations extensively :) 

> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>

[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree

2015-10-09 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14950498#comment-14950498
 ] 

Yi Liu edited comment on HDFS-9053 at 10/9/15 2:51 PM:
---

I find a good approach to improve B-Tree memory overhead to make it only 
increase *8 bytes* memory usage comparing with using {{ArrayList}} for small 
elements size. 
So we don't need to use ArrayList when #children is small (< 4K), and we can 
always use the {{BTree}}.
The main idea is to let {{BTree}} extend the BTree Node, then we don't need a 
separate root node, since {{BTree}} itself is the root.

{noformat}
java.util.ArrayList object internals:
 OFFSET  SIZE TYPE DESCRIPTIONVALUE
  016  (object header)N/A
 16 4  int AbstractList.modCount  N/A
 20 4  (alignment/padding gap)N/A
 24 4  int ArrayList.size N/A
 28 4  (alignment/padding gap)N/A
 32 8 Object[] ArrayList.elementData  N/A
Instance size: 40 bytes (estimated, the sample instance is not available)
{noformat}

{noformat}
org.apache.hadoop.util.btree.BTree object internals:
 OFFSET  SIZE TYPE DESCRIPTIONVALUE
  016  (object header)N/A
 16 4  int Node.elementsSize  N/A
 20 4  int Node.childrenSize  N/A
 24 8 Object[] Node.elements  N/A
 32 8 Object[] Node.children  N/A
 40 4  int BTree.size N/A
 44 4  int BTree.modCount N/A
Instance size: 48 bytes (estimated, the sample instance is not available)
{noformat}
We can see {{BTree}} only increases *8 bytes* comparing with {{ArrayList}} for 
a {{INodeDirectory}}.

[~jingzhao], [~szetszwo], please look at the new patch {{006}}.


was (Author: hitliuyi):
I find a good approach to improve B-Tree overhead to make it only increase *8 
bytes* memory usage comparing with using {{ArrayList}} for small elements size. 
So we don't need to use ArrayList when #children is small (< 4K), and we can 
always use the {{BTree}}.
The main idea is to let {{BTree}} extend the BTree Node, then we don't need a 
separate root node, since {{BTree}} itself is the root.

{noformat}
java.util.ArrayList object internals:
 OFFSET  SIZE TYPE DESCRIPTIONVALUE
  016  (object header)N/A
 16 4  int AbstractList.modCount  N/A
 20 4  (alignment/padding gap)N/A
 24 4  int ArrayList.size N/A
 28 4  (alignment/padding gap)N/A
 32 8 Object[] ArrayList.elementData  N/A
Instance size: 40 bytes (estimated, the sample instance is not available)
{noformat}

{noformat}
org.apache.hadoop.util.btree.BTree object internals:
 OFFSET  SIZE TYPE DESCRIPTIONVALUE
  016  (object header)N/A
 16 4  int Node.elementsSize  N/A
 20 4  int Node.childrenSize  N/A
 24 8 Object[] Node.elements  N/A
 32 8 Object[] Node.children  N/A
 40 4  int BTree.size N/A
 44 4  int BTree.modCount N/A
Instance size: 48 bytes (estimated, the sample instance is not available)
{noformat}
We can see {{BTree}} only increases *8 bytes* comparing with {{ArrayList}} for 
a {{INodeDirectory}}.

[~jingzhao], [~szetszwo], please look at the new patch {{006}}.

> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, 
> HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete, the time complexity is 
> O\(n), (the search is O(log n), but insertion/deleting causes re-allocations 
> and copies of arrays), for large directory, the operations are expensive.  If 
> the children grow to 1M size, the ArrayList will resize to > 1M capacity, so 
> need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) 
> continuous heap memory, it easily causes full GC in 

[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree

2015-10-08 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949847#comment-14949847
 ] 

Yi Liu edited comment on HDFS-9053 at 10/9/15 3:51 AM:
---

Hi [~szetszwo], thanks for your comments.

I will do as your suggestion: use an array list if the children size is small 
(<= 4K), otherwise use B-Tree. But there may be some differences for the 
implementation details.

{quote}
The field in INodeDirectory is List children which may refer to either 
an ArrayList or a BTreeList. We may replace the list at runtime
{quote}
It's hard to use a List, since when we use ArrayList, we do searching 
before index/delete and access through index, but BTree is in-order and we 
don't access through index. So I think we do it in following way:
# actually I made an initial patch of switching between array list and b-tree 
few days ago. The logic of ArrayList is not complicated, so I implement it in a 
new class and support shrinkable, in the new class, I control the array and 
expanding, also the new class keeps reference to a b-tree, if the elements size 
becomes large, it switches to use the b-tree. The new class is an in-order data 
structure, not like ArrayList which we need to search before operating.   In 
INodeDirectory, we just need to use the new class, another reason of I 
implement a new data structure class and don't do the switching in 
INodeDirectory is: we should have a {{SWITCH_THRESHOLD}} for switching from 
array list to b-tree, and need a low water mark to switch back, they should not 
be the same value, otherwise, the switching becomes frequent at some point, so 
I don't want to expose too many internal logic of switching in INodeDirectly. 
But the final memory usage of the new data structure is the same as ArrayList, 
even it has a reference to b-tree, and it supports Shrinkable  I will give 
the memory usage after posting a new patch.


{quote}
 I am also worry about the potiental bugs and the risk. If there is a bug in 
B-Tree, it is possible to lose one or more sub trees and, as a result, lose a 
lot of data. ArrayList is already well-tested. Anyway, we need more tests for 
the B-Tree, especially some long running random tests.
{quote}
Sure, I will add more tests for it, I have added many tests including some long 
running. I agree with that a bug-free data structure implementation is not 
easy, we should be careful and test the new implementations extensively :) 


was (Author: hitliuyi):
Hi [~szetszwo], thanks for your comments.

I will do as your suggestion: use an array list if the children size is small 
(<= 4K), otherwise use B-Tree. But there may be some differences for the 
implementation details.

{quote}
The field in INodeDirectory is List children which may refer to either 
an ArrayList or a BTreeList. We may replace the list at runtime
{quote}
It's hard to use a List, since when we use ArrayList, we do searching 
before index/delete and access through index, but BTree is in-order and we 
don't access through index. So I think we do it in following way:
# actually I made an initial patch of switching between array list and b-tree 
few days ago. The logic of ArrayList is not complicated, so I implement it in a 
new class and support shrinkable, in the new class, I control the array and 
expanding, also the new class keeps reference to a b-tree, if the elements size 
becomes large, it switches to use the b-tree. The new class is an in-order data 
structure, not like ArrayList which we need to search before operating.   In 
INodeDirectory, we just need to use the new class, another reason of I 
implement a new data structure class and don't do the switching in 
INodeDirectory is: we should have a {{SWITCH_THRESHOLD}} for switching from 
array list to b-tree, and need a low water mark to switch back, they should not 
be the same value, otherwise, the switching becomes frequent at some point, so 
I don't want to expose too many internal logic of switching in INodeDirectly. 
But the final memory usage of the new data structure is the same as ArrayList, 
even it has a reference to b-tree, and it supports Shrinkable  I will give 
the memory usage after posting a new patch.

{quote}
 I am also worry about the potiental bugs and the risk. If there is a bug in 
B-Tree, it is possible to lose one or more sub trees and, as a result, lose a 
lot of data. ArrayList is already well-tested. Anyway, we need more tests for 
the B-Tree, especially some long running random tests.
{quote}
Sure, I will add more tests for it, I have added many tests including some long 
running. I agree with that a bug-free data structure implementation is not 
easy, we should be careful and test the new implementations extensively :) 

> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: 

[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree

2015-10-02 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942004#comment-14942004
 ] 

Yi Liu edited comment on HDFS-9053 at 10/3/15 12:49 AM:


Thanks [~szetszwo], good comment, I ever considered it carefully too. I want to 
convince you to allow me only use B-Tree here:
# Use the case you said, the #children is small and < 4K. *1)* If children is < 
2K, then B-Tree only contains a root. As we counted before, the increased 
overhead is only 44 bytes which is really very small for a directory, a 
continuous block is 80 bytes memory (detail below), so we only increase about 
1/2 continuous block for a directory in NN. *2)* If the children is > 2K and < 
4K, here we use 4K as example, the B-Tree at most contains 3 branches: 1 root 
node, 3 leaf nodes. One leaf node increase about (40 bytes + 16 bytes elements 
array overhead) = 56 bytes, and 1 root node is (40 bytes + 16 bytes elements 
array overhead + 16 bytes children overhead + 3 children * 8) = 96 bytes, the 
b-tree itself is 40 bytes, and we need to subtract the ArrayList (40 bytes + 16 
bytes elements array overhead) = 56 bytes, so we at most increase 56 * 3 + 96 + 
40 - 56 = 248 bytes overhead, but ArrayList of 4K references to INode needs 
more than 4K * 8 = 32K memory, then we can get that the increased memory is 
only *0.75%*
{noformat}
org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoContiguous object 
internals:
 OFFSET  SIZE  TYPE DESCRIPTIONVALUE
  016   (object header)N/A
 16 8  long Block.blockId  N/A
 24 8  long Block.numBytes N/A
 32 8  long Block.generationStamp  N/A
 40 8  long BlockInfo.bcId N/A
 48 2 short BlockInfo.replication  N/A
 50 6   (alignment/padding gap)N/A
 56 8 LinkedElement BlockInfo.nextLinkedElementN/A
 64 8  Object[] BlockInfo.triplets N/A
 72 8 BlockUnderConstructionFeature BlockInfo.uc   N/A
Instance size: 80 bytes (estimated, the sample instance is not available)
{noformat}
# One advantage of B-Tree compared to ArrayList for small size children is: 
B-Tree can shrink. If the children of directory decreases from 4K to less than 
2K, there are 2K * 8 = 16K memory wasteful if using ArrayList. 
# On the other hand, if we do the switch between ArrayList and B-Tree, we may 
need write a class to wrap the two data structures, then  it still needs 
16bytes object overhead + 8 bytes additional reference = 24 bytes. 

How do you say? Thanks, Nicholas.


was (Author: hitliuyi):
Thanks [~szetszwo], good comment, I ever considered it carefully too. I want to 
convince you to allow me only use B-Tree here:
# Use the case you said, the #children is small and < 4K. *1)* If children is < 
2K, then B-Tree only contains a root. As we counted before, the increased 
overhead is only 44 bytes which is really very small for a directory, a 
continuous block is 80 bytes memory (detail below), so we only increase about 
1/2 continuous block for a directory in NN. *2)* If the children is > 2K and < 
4K, here we use 4K as example, the B-Tree at most contains 3 branches: 1 root 
node, 3 leaf nodes. One leaf node increase about (40 bytes + 16 bytes elements 
array overhead) = 56 bytes, and 1 root node is (40 bytes + 16 bytes elements 
array overhead + 16 bytes children overhead + 3 children * 8) = 96 bytes, the 
b-tree itself is 40 bytes, and we need to subtract the ArrayList (40 bytes + 16 
bytes elements array overhead) = 56 bytes, so we at most increase 56 * 3 + 96 + 
40 - 56 = 248 bytes overhead, but ArrayList of 4K references to INode needs 
more than 4K * 8 = 32K memory, then we can get that the increased memory is 
only *0.75%*
{noformat}
org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoContiguous object 
internals:
 OFFSET  SIZE  TYPE DESCRIPTIONVALUE
  016   (object header)N/A
 16 8  long Block.blockId  N/A
 24 8  long Block.numBytes N/A
 32 8  long Block.generationStamp  N/A
 40 8  long BlockInfo.bcId N/A
 48 2 short BlockInfo.replication  N/A
 50 6   (alignment/padding gap)N/A
 56 8 LinkedElement BlockInfo.nextLinkedElementN/A
 64 8  

[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree

2015-10-02 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942004#comment-14942004
 ] 

Yi Liu edited comment on HDFS-9053 at 10/3/15 12:47 AM:


Thanks [~szetszwo], good comment, I ever considered it carefully too. I want to 
convince you to allow me only use B-Tree here:
# Use the case you said, the #children is small and < 4K. *1)* If children is < 
2K, then B-Tree only contains a root. As we counted before, the increased 
overhead is only 44 bytes which is really very small for a directory, a 
continuous block is 80 bytes memory (detail below), so we only increase about 
1/2 continuous block for a directory in NN. *2)* If the children is > 2K and < 
4K, here we use 4K as example, the B-Tree at most contains 3 branches: 1 root 
node, 3 leaf nodes. One leaf node increase about (40 bytes + 16 bytes elements 
array overhead) = 56 bytes, and 1 root node is (40 bytes + 16 bytes elements 
array overhead + 16 bytes children overhead + 3 children * 8) = 96 bytes, the 
b-tree itself is 40 bytes, and we need to subtract the ArrayList (40 bytes + 16 
bytes elements array overhead) = 56 bytes, so we at most increase 56 * 3 + 96 + 
40 - 56 = 248 bytes overhead, but ArrayList of 4K references to INode needs 
more than 4K * 8 = 32K memory, then we can get that the increased memory is 
only *0.75%*
{noformat}
org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoContiguous object 
internals:
 OFFSET  SIZE  TYPE DESCRIPTIONVALUE
  016   (object header)N/A
 16 8  long Block.blockId  N/A
 24 8  long Block.numBytes N/A
 32 8  long Block.generationStamp  N/A
 40 8  long BlockInfo.bcId N/A
 48 2 short BlockInfo.replication  N/A
 50 6   (alignment/padding gap)N/A
 56 8 LinkedElement BlockInfo.nextLinkedElementN/A
 64 8  Object[] BlockInfo.triplets N/A
 72 8 BlockUnderConstructionFeature BlockInfo.uc   N/A
Instance size: 80 bytes (estimated, the sample instance is not available)
{noformat}
# One advantage of B-Tree compared to ArrayList for small size children is: 
B-Tree can shrink. If the children of directory decreases from 4K to less than 
2K, there are 2K * 8 = 16K memory wasteful if suing ArrayList. 
# On the other hand, if we do the switch between ArrayList and B-Tree, we may 
need write a class to wrap the two data structures, then  it still needs 
16bytes object overhead + 8 bytes additional reference = 24 bytes. 

How do you say? Thanks, Nicholas.


was (Author: hitliuyi):
Thanks [~szetszwo], good comment, I ever considered it carefully too. I want to 
convince you to allow me only use B-Tree here:
# Use the case you said, the #children is small and < 4K. *1)* If children is < 
2K, then B-Tree only contains a root. As we counted before, the increased 
overhead is only 44 bytes which is really very small for a directory, a 
continuous block is 80 bytes memory (detail below), so we only increase about 
1/2 continuous block for a directory in NN. *2)* If the children is > 2K and < 
4K, here we use 4K as example, the B-Tree at most contains 3 branches: 1 root 
node, 3 leaf nodes. One leaf node increase about (40 bytes + 16 bytes elements 
array overhead) = 56 bytes, and 1 root node is (40 bytes + 16 bytes elements 
array overhead + 16 bytes children overhead + 3 children * 8) = 96 bytes, the 
b-tree itself is 40 bytes, and we need to subtract the ArrayList (40 bytes + 16 
bytes elements array overhead) = 56 bytes, so we at most increase 56 * 3 + 96 + 
40 - 56 = 248 bytes overhead, but ArrayList of 4K references to INode needs 
more than 4K * 8 = 32K memory, then we can get that the increased memory is 
only *0.75%*
{noformat}
org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoContiguous object 
internals:
 OFFSET  SIZE  TYPE DESCRIPTIONVALUE
  016   (object header)N/A
 16 8  long Block.blockId  N/A
 24 8  long Block.numBytes N/A
 32 8  long Block.generationStamp  N/A
 40 8  long BlockInfo.bcId N/A
 48 2 short BlockInfo.replication  N/A
 50 6   (alignment/padding gap)N/A
 56 8 LinkedElement BlockInfo.nextLinkedElementN/A
 64 8  

[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree

2015-09-30 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14936183#comment-14936183
 ] 

Yi Liu edited comment on HDFS-9053 at 10/1/15 1:33 AM:
---

Thanks [~szetszwo] for the comments.

{quote}
here are quite a few errors in description and the analysis. Let's hold on the 
patch for the moment and fix all these errors first.
{quote}
Sure, let's fix the errors in the description and the analysis first, thanks 
for pointing out this.

{quote}
> So totally 12+4+4+4+4+4+4 = 32 bytes on 32-bits system/JVM, and 12+4+8+4+8+4 
> = 40 bytes on 64-bits system/JVM. (I have not counted object alignment)
It seems that there is at least one more reference not yet counted since Node 
is non-static.
{quote}
I mean the *increased memory* compared with using ArrayList, and I have stated 
this in above comments.
The reference to Node is already counted, use 64-bits system/JVM as example, 
the first '8' is the reference to the B-Tree root.  The second '8' is the null 
reference for children in BTree#Node.  This doesn't count the memory parts 
which ArrayList also has: 16 bytes object overhead + 4 bytes int modCount + 8 
bytes reference to elements + 4 bytes int elementsSize + elements array memory.



Nicholas, I am going to fix the two places in the JIRA description and analysis 
in JIRA comments, which we find so far
# fix the time complexity of insert/delete for ArrayList
# fix the description about memory usage for B-Tree to add the system/JVM 
information and add more detailed information.

Any other errors do you find besides these? please tell me and appreciate if 
you can also help to edit them directly.


was (Author: hitliuyi):
Thanks [~szetszwo] for the comments.

{quote}
here are quite a few errors in description and the analysis. Let's hold on the 
patch for the moment and fix all these errors first.
{quote}
Sure, let's fix the errors in the description and the analysis first, thanks 
for pointing out this.

{quote}
> So totally 12+4+4+4+4+4+4 = 32 bytes on 32-bits system/JVM, and 12+4+8+4+8+4 
> = 40 bytes on 64-bits system/JVM. (I have not counted object alignment)
It seems that there is at least one more reference not yet counted since Node 
is non-static.
{quote}
I mean the *increased memory* compared with using ArrayList, and I have stated 
this in above comments.
The reference to Node is already counted, use 64-bits system/JVM as example, 
the first '8' is the reference to the B-Tree root.  The second '8' is the null 
reference for children in BTree#Node.  This doesn't count the memory parts 
which ArrayList also has: 12 bytes object overhead + 4 bytes int modCount + 8 
bytes reference to elements + 4 bytes int elementsSize + elements array memory.



Nicholas, I am going to fix the two places in the JIRA description and analysis 
in JIRA comments, which we find so far
# fix the time complexity of insert/delete for ArrayList
# fix the description about memory usage for B-Tree to add the system/JVM 
information and add more detailed information.

Any other errors do you find besides these? please tell me and appreciate if 
you can also help to edit them directly.

> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete, the time complexity is 
> O\(n), (the search is O(log n), but insertion/deleting causes re-allocations 
> and copies of arrays), for large directory, the operations are expensive.  If 
> the children grow to 1M size, the ArrayList will resize to > 1M capacity, so 
> need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) 
> continuous heap memory, it easily causes full GC in HDFS cluster where 
> namenode heap memory is already highly used.  I recap the 3 main issues:
> # Insertion/deletion operations in large directories are expensive because 
> re-allocations and copies of big arrays.
> # Dynamically allocate several MB continuous heap memory which will be 
> long-lived can easily cause full GC problem.
> # Even most children are removed later, but the directory INode still 
> occupies same size heap memory, since the ArrayList will never shrink.
> This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to 
> solve the problem 

[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree

2015-09-30 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934634#comment-14934634
 ] 

Yi Liu edited comment on HDFS-9053 at 10/1/15 1:32 AM:
---

Thanks for the comments, Nicholas.

{quote}
Where are the numbers, especially the 4s, from? Do we assume a 32-bit world?
{quote}

{code}
public class BTree implements Iterable {
  ...
  private final int degree;
  private Node root;
  private int size;
  private transient int modCount = 0;
 ...
}

private final class Node {
static final int DEFAULT_CAPACITY = 5;
private Object[] elements;
private int elementsSize; 
private Object[] children;
private int childrenSize;
...
}
{code}
Sorry, I should use 64-bits system/JVM to describe, and details are:

Compared to ArrayList, we increases following things:
private final int degree;   <-   4 bytes Integer
private Node root;   <-  reference, 4 bytes on 32-bits 
system/JVM, 8 bytes on 64-bits system/JVM
private int size;<-  4 bytes Integer

{{Node}} object overhead   <--  12 bytes
private Object[] children; <-  null reference, 4 bytes on 
32-bits system/JVM, 8 bytes on 64-bits system/JVM
private int childrenSize;   <-  4 bytes Integer.

So totally  12+4+4+4+4+4+4 = 32 bytes on 32-bits system/JVM, and 16+4+8+4+8+4 = 
44 bytes on 64-bits system/JVM. (I have not counted object alignment)



was (Author: hitliuyi):
Thanks for the comments, Nicholas.

{quote}
Where are the numbers, especially the 4s, from? Do we assume a 32-bit world?
{quote}

{code}
public class BTree implements Iterable {
  ...
  private final int degree;
  private Node root;
  private int size;
  private transient int modCount = 0;
 ...
}

private final class Node {
static final int DEFAULT_CAPACITY = 5;
private Object[] elements;
private int elementsSize; 
private Object[] children;
private int childrenSize;
...
}
{code}
Sorry, I should use 64-bits system/JVM to describe, and details are:

Compared to ArrayList, we increases following things:
private final int degree;   <-   4 bytes Integer
private Node root;   <-  reference, 4 bytes on 32-bits 
system/JVM, 8 bytes on 64-bits system/JVM
private int size;<-  4 bytes Integer

{{Node}} object overhead   <--  12 bytes
private Object[] children; <-  null reference, 4 bytes on 
32-bits system/JVM, 8 bytes on 64-bits system/JVM
private int childrenSize;   <-  4 bytes Integer.

So totally  12+4+4+4+4+4+4 = 32 bytes on 32-bits system/JVM, and 12+4+8+4+8+4 = 
40 bytes on 64-bits system/JVM. (I have not counted object alignment)


> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete, the time complexity is 
> O\(n), (the search is O(log n), but insertion/deleting causes re-allocations 
> and copies of arrays), for large directory, the operations are expensive.  If 
> the children grow to 1M size, the ArrayList will resize to > 1M capacity, so 
> need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) 
> continuous heap memory, it easily causes full GC in HDFS cluster where 
> namenode heap memory is already highly used.  I recap the 3 main issues:
> # Insertion/deletion operations in large directories are expensive because 
> re-allocations and copies of big arrays.
> # Dynamically allocate several MB continuous heap memory which will be 
> long-lived can easily cause full GC problem.
> # Even most children are removed later, but the directory INode still 
> occupies same size heap memory, since the ArrayList will never shrink.
> This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to 
> solve the problem suggested by [~shv]. 
> So the target of this JIRA is to implement a low memory footprint B-Tree and 
> use it to replace ArrayList. 
> If the elements size is not large (less than the maximum degree of B-Tree 
> node), the B-Tree only has one root node which contains an array for the 
> elements. And if the size 

[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree

2015-09-30 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934598#comment-14934598
 ] 

Yi Liu edited comment on HDFS-9053 at 10/1/15 1:30 AM:
---

Thanks Jing for the further review. 
{quote}
Yeah, sorry for the delay.
{quote}
NP, thanks for the help :)

I also think it's a critical part of the code, thanks a lot for the review, 
[~szetszwo]! also welcome [~kihwal] and anyone who can review this.

{quote}
For insert/delete, the running time is O( n) for ArrayList.
{quote}
Yeah, if we count the re-allocations and copies of big arrays, I just mean the 
search time complexity before we do insert/delete.

{quote}
How about memory usage? One reason to use ArrayList instead of a more 
complicated data structure is to save memory. It seems to me that using B-Tree 
increases memory usage in general. It increases memory usage dramatically in 
some worst cases such as the average branching factor of the tree is small 
(this may also be a common case).
{quote}
That's a great question here.  I have given few description in the my second 
comment above: 
[here|https://issues.apache.org/jira/browse/HDFS-9053?focusedCommentId=14740814=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14740814],
 I would give a detailed explanation:
{quote}
in some worst cases such as the average branching factor of the tree is small
{quote}
Don't worry about this, for any B-Tree node except root, it contains at least 
min degree size of elements, otherwise there is merging or element shift to 
guarantee this property. Besides, we expand the elements in B-Tree node the 
same as ArrayList.

Generally say, the B-Tree in the patch increases *very few* memory usage, and 
can be ignored:
# *For normal and small size directories*, let's say it's (INode) children size 
is not large and then B-Tree only contains one node. Currently the default 
degree of B-Tree used in INodeDirectory is 1024, so the max degree is 2047, 
that means one B-Tree node can at most contain 2047 elements. And the children 
of the B-Tree root node is null, so the increased memory usage compared to 
ArrayList is: one object overhead + few variables, total increase about (16 
bytes object overhead + 4 int degree + 8 bytes reference to root node + 4 int 
size + 8 bytes null reference children + 4 bytes int childrenSize) = 44 bytes, 
and assume it's 64-bits system/JVM.
# *For large directories*, let's say it's (INode) children size is larger than 
the max degree and then B-Tree contains more than one nodes. For any B-Tree 
node except root, it contains at least min degree size elements, in our case, 
it's 1023, furthermore, we expand elements of B-Tree node same as ArrayList, so 
it's the same as ArrayList at this point, typically the worst case for B-Tree 
from memory point view is all elements are in-order, in this case, when we 
split a B-Tree node, we allocate the elements size of memory and do expanding, 
but no more elements are added later in the split left B-Tree node, and there 
is a waste of 1/3 size of elements array memory, but actually it's almost the 
same as the worst cases even for ArrayList if there is no/few elements added 
after expanding, but in practice, the elements are not in order.  Now back to 
the overhead, for B-Tree, for every B-Tree node, the increased memory usage is: 
one object overhead + one element pointer in the father + 2 variables, so total 
increase about (2* 16 two object overhead + 8 bytes reference to node in the 
father +  4 bytes int elementsSize + 8 bytes reference to children + 4 bytes in 
childrenSize) = 56 bytes for every new added B-Tree node, but one B-Tree node 
contains at least (degree - 1) elements, and at most (2* degree - 1) elements.  
Another small increased memory is the object overhead of the children in the 
B-Tree inner node for every (degree - 1) to (2* degree - 1) children B-Tree 
nodes.

In conclusion, the increased memory used for B-Tree compared to ArrayList is 
very small, for small/normal directories, it's only 44 bytes overhead which is 
even less than memory of one block in NN, for large directories, besides 
additional increased (4 bytes int degree + 8 bytes reference to root node + 4 
bytes int size) = 16 bytes memory for the variables in B-Tree , it will 
increase about 56 bytes memory for every new added B-Tree node which can 
contain 1023 ~ 2047 INodes in the directory in our case.  We can ignore these 
few memory overhead for a directory. 

And last, the benefits of B-Tree is great and obvious as described in the JIRA 
description, also in the patch, I consider the memory overhead carefully while 
implementing the B-Tree. 

Please let me know if you have further questions. Thanks a lot.


was (Author: hitliuyi):
Thanks Jing for the further review. 
{quote}
Yeah, sorry for the delay.
{quote}
NP, thanks for the help :)

I also think it's a critical part of the 

[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree

2015-09-29 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934689#comment-14934689
 ] 

Yi Liu edited comment on HDFS-9053 at 9/29/15 6:18 AM:
---

[~jingzhao], you have given a great review, thanks a lot, and it hits the two 
places I ever considered carefully how to do it better.

{quote}
In INodeDirectory#replaceChild, can we directly call addOrReplace instead of 
calling get first?
{quote}
I think we can use {{addOrReplace}} directly.  Since there should be an INode 
with the name existing.  To keep original behavior, I will remove the added one 
if {{addOrReplace}} returns null.

{quote}
Do you think we can avoid the following code? Maybe we can add the EK type to 
the ReadOnlyCollection/ReadOnlyList level?
{quote}
That's a good comment, I ever considered this carefully.  I also thought adding 
the EK type as one of generic type to the ReadOnlyCollection/ReadOnlyList 
level, but I felt it looked not natural for a collection/list, and not all 
implementations of ReadOnlyList need to implement iterating starting from 
specified element, also I though it was OK since it's a private interface we 
use in HDFS.  I will leave this comment in next version of patch, if you feel 
we'd better to do this, I will update it, I am OK with the both ways.

{quote}
DirectoryWithSnapshotFeature#getChildrenList#iterator(EK) forgot to increase 
pos? Maybe also add a new test for this (e.g., set a small ls limit and list a 
snapshot of a directory)?
{quote}
Great catch, let me update it, and add a new test in {{TestLargeDirectory}} to 
cover.

{quote}
In getListing, instead of continuing the iteration, can we just call size() to 
calculate the number of the remaining items?
{quote}
I ever tried to find a better way. {{size()}} will return the total number of 
elements in B-Tree, but we don't know the current index, so seems not able to 
calculate the number of the remaining items.




was (Author: hitliuyi):
[~jingzhao], you have given a good review, and it hits the two places I ever 
considered carefully how to do it better. Thanks.

{quote}
In INodeDirectory#replaceChild, can we directly call addOrReplace instead of 
calling get first?
{quote}
I think we can use {{addOrReplace}} directly.  Since there should be an INode 
with the name existing.  To keep original behavior, I will remove the added one 
if {{addOrReplace}} returns null.

{quote}
Do you think we can avoid the following code? Maybe we can add the EK type to 
the ReadOnlyCollection/ReadOnlyList level?
{quote}
That's a good comment, I ever considered this carefully.  I also thought adding 
the EK type as one of generic type to the ReadOnlyCollection/ReadOnlyList 
level, but I felt it looked not natural for a collection/list, and not all 
implementations of ReadOnlyList need to implement iterating starting from 
specified element, also I though it was OK since it's a private interface we 
use in HDFS.  I will leave this comment in next version of patch, if you feel 
we'd better to do this, I will update it, I am OK with the both ways.

{quote}
DirectoryWithSnapshotFeature#getChildrenList#iterator(EK) forgot to increase 
pos? Maybe also add a new test for this (e.g., set a small ls limit and list a 
snapshot of a directory)?
{quote}
Great catch, let me update it, and add a new test in {{TestLargeDirectory}} to 
cover.

{quote}
In getListing, instead of continuing the iteration, can we just call size() to 
calculate the number of the remaining items?
{quote}
I ever tried to find a better way. {{size()}} will return the total number of 
elements in B-Tree, but we don't know the current index, so seems not able to 
calculate the number of the remaining items.



> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete/search, the time 
> complexity is O(log n), but insertion/deleting causes re-allocations and 
> copies of big arrays, so the operations are costly.  For example, if the 
> children grow to 1M size, the ArrayList will resize to > 1M capacity, so need 
> > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS 
> cluster where namenode heap memory is already highly used.  I recap the 3 
> main issues:
> # Insertion/deletion 

[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree

2015-09-29 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934689#comment-14934689
 ] 

Yi Liu edited comment on HDFS-9053 at 9/29/15 6:55 AM:
---

[~jingzhao], you have given a great review, thanks a lot, and it hits the two 
places I ever considered carefully how to do it better.

{quote}
In INodeDirectory#replaceChild, can we directly call addOrReplace instead of 
calling get first?
{quote}
Good point, I think we can use {{addOrReplace}} directly.  Since there should 
be an INode with the name existing.  To keep original behavior, I will remove 
the added one if {{addOrReplace}} returns null.

{quote}
Do you think we can avoid the following code? Maybe we can add the EK type to 
the ReadOnlyCollection/ReadOnlyList level?
{quote}
That's a good comment, I ever considered this carefully.  I also thought adding 
the EK type as one of generic type to the ReadOnlyCollection/ReadOnlyList 
level, but I felt it looked not natural for a collection/list, and not all 
implementations of ReadOnlyList need to implement iterating starting from 
specified element, also I though it was OK since it's a private interface we 
use in HDFS. 
I will update according to your comment, if you feel it's not good later, I can 
change it back.

{quote}
DirectoryWithSnapshotFeature#getChildrenList#iterator(EK) forgot to increase 
pos? Maybe also add a new test for this (e.g., set a small ls limit and list a 
snapshot of a directory)?
{quote}
Great catch, let me update it, and add a new test in {{TestLargeDirectory}} to 
cover.

{quote}
In getListing, instead of continuing the iteration, can we just call size() to 
calculate the number of the remaining items?
{quote}
I ever tried to find a better way. {{size()}} will return the total number of 
elements in B-Tree, but we don't know the current index, so seems not able to 
calculate the number of the remaining items.




was (Author: hitliuyi):
[~jingzhao], you have given a great review, thanks a lot, and it hits the two 
places I ever considered carefully how to do it better.

{quote}
In INodeDirectory#replaceChild, can we directly call addOrReplace instead of 
calling get first?
{quote}
Good point, I think we can use {{addOrReplace}} directly.  Since there should 
be an INode with the name existing.  To keep original behavior, I will remove 
the added one if {{addOrReplace}} returns null.

{quote}
Do you think we can avoid the following code? Maybe we can add the EK type to 
the ReadOnlyCollection/ReadOnlyList level?
{quote}
That's a good comment, I ever considered this carefully.  I also thought adding 
the EK type as one of generic type to the ReadOnlyCollection/ReadOnlyList 
level, but I felt it looked not natural for a collection/list, and not all 
implementations of ReadOnlyList need to implement iterating starting from 
specified element, also I though it was OK since it's a private interface we 
use in HDFS.  I will leave this comment in next version of patch, if you feel 
we'd better to do this, I will update it, I am OK with the both ways.

{quote}
DirectoryWithSnapshotFeature#getChildrenList#iterator(EK) forgot to increase 
pos? Maybe also add a new test for this (e.g., set a small ls limit and list a 
snapshot of a directory)?
{quote}
Great catch, let me update it, and add a new test in {{TestLargeDirectory}} to 
cover.

{quote}
In getListing, instead of continuing the iteration, can we just call size() to 
calculate the number of the remaining items?
{quote}
I ever tried to find a better way. {{size()}} will return the total number of 
elements in B-Tree, but we don't know the current index, so seems not able to 
calculate the number of the remaining items.



> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete/search, the time 
> complexity is O(log n), but insertion/deleting causes re-allocations and 
> copies of big arrays, so the operations are costly.  For example, if the 
> children grow to 1M size, the ArrayList will resize to > 1M capacity, so need 
> > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS 
> cluster where namenode heap memory is already highly used.  I recap the 3 
> main issues:
> # Insertion/deletion operations 

[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree

2015-09-29 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934689#comment-14934689
 ] 

Yi Liu edited comment on HDFS-9053 at 9/29/15 6:20 AM:
---

[~jingzhao], you have given a great review, thanks a lot, and it hits the two 
places I ever considered carefully how to do it better.

{quote}
In INodeDirectory#replaceChild, can we directly call addOrReplace instead of 
calling get first?
{quote}
Good point, I think we can use {{addOrReplace}} directly.  Since there should 
be an INode with the name existing.  To keep original behavior, I will remove 
the added one if {{addOrReplace}} returns null.

{quote}
Do you think we can avoid the following code? Maybe we can add the EK type to 
the ReadOnlyCollection/ReadOnlyList level?
{quote}
That's a good comment, I ever considered this carefully.  I also thought adding 
the EK type as one of generic type to the ReadOnlyCollection/ReadOnlyList 
level, but I felt it looked not natural for a collection/list, and not all 
implementations of ReadOnlyList need to implement iterating starting from 
specified element, also I though it was OK since it's a private interface we 
use in HDFS.  I will leave this comment in next version of patch, if you feel 
we'd better to do this, I will update it, I am OK with the both ways.

{quote}
DirectoryWithSnapshotFeature#getChildrenList#iterator(EK) forgot to increase 
pos? Maybe also add a new test for this (e.g., set a small ls limit and list a 
snapshot of a directory)?
{quote}
Great catch, let me update it, and add a new test in {{TestLargeDirectory}} to 
cover.

{quote}
In getListing, instead of continuing the iteration, can we just call size() to 
calculate the number of the remaining items?
{quote}
I ever tried to find a better way. {{size()}} will return the total number of 
elements in B-Tree, but we don't know the current index, so seems not able to 
calculate the number of the remaining items.




was (Author: hitliuyi):
[~jingzhao], you have given a great review, thanks a lot, and it hits the two 
places I ever considered carefully how to do it better.

{quote}
In INodeDirectory#replaceChild, can we directly call addOrReplace instead of 
calling get first?
{quote}
I think we can use {{addOrReplace}} directly.  Since there should be an INode 
with the name existing.  To keep original behavior, I will remove the added one 
if {{addOrReplace}} returns null.

{quote}
Do you think we can avoid the following code? Maybe we can add the EK type to 
the ReadOnlyCollection/ReadOnlyList level?
{quote}
That's a good comment, I ever considered this carefully.  I also thought adding 
the EK type as one of generic type to the ReadOnlyCollection/ReadOnlyList 
level, but I felt it looked not natural for a collection/list, and not all 
implementations of ReadOnlyList need to implement iterating starting from 
specified element, also I though it was OK since it's a private interface we 
use in HDFS.  I will leave this comment in next version of patch, if you feel 
we'd better to do this, I will update it, I am OK with the both ways.

{quote}
DirectoryWithSnapshotFeature#getChildrenList#iterator(EK) forgot to increase 
pos? Maybe also add a new test for this (e.g., set a small ls limit and list a 
snapshot of a directory)?
{quote}
Great catch, let me update it, and add a new test in {{TestLargeDirectory}} to 
cover.

{quote}
In getListing, instead of continuing the iteration, can we just call size() to 
calculate the number of the remaining items?
{quote}
I ever tried to find a better way. {{size()}} will return the total number of 
elements in B-Tree, but we don't know the current index, so seems not able to 
calculate the number of the remaining items.



> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete/search, the time 
> complexity is O(log n), but insertion/deleting causes re-allocations and 
> copies of big arrays, so the operations are costly.  For example, if the 
> children grow to 1M size, the ArrayList will resize to > 1M capacity, so need 
> > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS 
> cluster where namenode heap memory is already highly used.  I recap the 3 
> main issues:
> # 

[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree

2015-09-29 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934598#comment-14934598
 ] 

Yi Liu edited comment on HDFS-9053 at 9/30/15 1:29 AM:
---

Thanks Jing for the further review. 
{quote}
Yeah, sorry for the delay.
{quote}
NP, thanks for the help :)

I also think it's a critical part of the code, thanks a lot for the review, 
[~szetszwo]! also welcome [~kihwal] and anyone who can review this.

{quote}
For insert/delete, the running time is O( n) for ArrayList.
{quote}
Yeah, if we count the re-allocations and copies of big arrays, I just mean the 
search time complexity before we do insert/delete.

{quote}
How about memory usage? One reason to use ArrayList instead of a more 
complicated data structure is to save memory. It seems to me that using B-Tree 
increases memory usage in general. It increases memory usage dramatically in 
some worst cases such as the average branching factor of the tree is small 
(this may also be a common case).
{quote}
That's a great question here.  I have given few description in the my second 
comment above: 
[here|https://issues.apache.org/jira/browse/HDFS-9053?focusedCommentId=14740814=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14740814],
 I would give a detailed explanation:
{quote}
in some worst cases such as the average branching factor of the tree is small
{quote}
Don't worry about this, for any B-Tree node except root, it contains at least 
min degree size of elements, otherwise there is merging or element shift to 
guarantee this property. Besides, we expand the elements in B-Tree node the 
same as ArrayList.

Generally say, the B-Tree in the patch increases *very few* memory usage, and 
can be ignored:
# *For normal and small size directories*, let's say it's (INode) children size 
is not large and then B-Tree only contains one node. Currently the default 
degree of B-Tree used in INodeDirectory is 1024, so the max degree is 2047, 
that means one B-Tree node can at most contain 2047 elements. And the children 
of the B-Tree root node is null, so the increased memory usage compared to 
ArrayList is: one object overhead + few variables, total increase about (12 
bytes object overhead + 4 int degree + 8 bytes reference to root node + 4 int 
size + 8 bytes null reference children + 4 bytes int childrenSize) = 40 bytes, 
and assume it's 64-bits system/JVM.
# *For large directories*, let's say it's (INode) children size is larger than 
the max degree and then B-Tree contains more than one nodes. For any B-Tree 
node except root, it contains at least min degree size elements, in our case, 
it's 1023, furthermore, we expand elements of B-Tree node same as ArrayList, so 
it's the same as ArrayList at this point, typically the worst case for B-Tree 
from memory point view is all elements are in-order, in this case, when we 
split a B-Tree node, we allocate the elements size of memory and do expanding, 
but no more elements are added later in the split left B-Tree node, and there 
is a waste of 1/3 size of elements array memory, but actually it's almost the 
same as the worst cases even for ArrayList if there is no/few elements added 
after expanding, but in practice, the elements are not in order.  Now back to 
the overhead, for B-Tree, for every B-Tree node, the increased memory usage is: 
one object overhead + one element pointer in the father + 2 variables, so total 
increase about (2* 12 two object overhead + 8 bytes reference to node in the 
father +  4 bytes int elementsSize + 8 bytes reference to children + 4 bytes in 
childrenSize) = 48 bytes for every new added B-Tree node, but one B-Tree node 
contains at least (degree - 1) elements, and at most (2* degree - 1) elements.  
Another small increased memory is the object overhead of the children in the 
B-Tree inner node for every (degree - 1) to (2* degree - 1) children B-Tree 
nodes.

In conclusion, the increased memory used for B-Tree compared to ArrayList is 
very small, for small/normal directories, it's only 40 bytes overhead which is 
even less than memory of one block in NN, for large directories, besides 
additional increased (4 bytes int degree + 8 bytes reference to root node + 4 
bytes int size) = 16 bytes memory for the variables in B-Tree , it will 
increase about 48 bytes memory for every new added B-Tree node which can 
contain 1023 ~ 2047 INodes in the directory in our case.  We can ignore these 
few memory overhead for a directory. 

And last, the benefits of B-Tree is great and obvious as described in the JIRA 
description, also in the patch, I consider the memory overhead carefully while 
implementing the B-Tree. 

Please let me know if you have further questions. Thanks a lot.


was (Author: hitliuyi):
Thanks Jing for the further review. 
{quote}
Yeah, sorry for the delay.
{quote}
NP, thanks for the help :)

I also think it's a critical part of the 

[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree

2015-09-29 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14936183#comment-14936183
 ] 

Yi Liu edited comment on HDFS-9053 at 9/30/15 3:48 AM:
---

Thanks [~szetszwo] for the comments.

{quote}
here are quite a few errors in description and the analysis. Let's hold on the 
patch for the moment and fix all these errors first.
{quote}
Sure, let's fix the errors in the description and the analysis first, thanks 
for pointing out this.

{quote}
> So totally 12+4+4+4+4+4+4 = 32 bytes on 32-bits system/JVM, and 12+4+8+4+8+4 
> = 40 bytes on 64-bits system/JVM. (I have not counted object alignment)
It seems that there is at least one more reference not yet counted since Node 
is non-static.
{quote}
The reference to Node is already counted, use 64-bits system/JVM as example, 
the first '8' is the reference to the B-Tree root.  The second '8' is the null 
reference for children in BTree#Node, of course we are talking about the 
directory which is not large and only contains one B-Tree node.


Nicholas, I am going to fix the two places in the JIRA description and analysis 
in JIRA comments, which we find so far
# fix the time complexity of insert/delete for ArrayList
# fix the description about memory usage for B-Tree to add the system/JVM 
information and add more detailed information.

Any other errors do you find besides these? please tell me and appreciate if 
you can also help to edit them directly.


was (Author: hitliuyi):
Thanks [~szetszwo] for the comments.

{quote}
here are quite a few errors in description and the analysis. Let's hold on the 
patch for the moment and fix all these errors first.
{quote}
Sure, let's fix the errors in the description and the analysis first, thanks 
for pointing out this.

{quote}
> So totally 12+4+4+4+4+4+4 = 32 bytes on 32-bits system/JVM, and 12+4+8+4+8+4 
> = 40 bytes on 64-bits system/JVM. (I have not counted object alignment)
It seems that there is at least one more reference not yet counted since Node 
is non-static.
{quote}
The reference to Node is already counted, use 64-bits system/JVM as example, 
the first '8' is the reference to the B-Tree root.  The second '8' is the null 
reference for children in BTree#Node, of course we are talking about the 
directory which is not large and only contains one B-Tree node.


Nicholas, I am going to fix the two places in the JIRA description and analysis 
in JIRA comments, which we find so far
# fix the time complexity of insert/delete for ArrayList
# fix the description about memory usage for B-Tree to add the system/JVM 
information and add more detailed information.

Any other errors do you find besides these? please tell me and appreciate if 
you can also help to edit them directly.

> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete, the time complexity is 
> O\(n), (the search is O(log n), but insertion/deleting causes re-allocations 
> and copies of arrays), for large directory, the operations are expensive.  If 
> the children grow to 1M size, the ArrayList will resize to > 1M capacity, so 
> need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) 
> continuous heap memory, it easily causes full GC in HDFS cluster where 
> namenode heap memory is already highly used.  I recap the 3 main issues:
> # Insertion/deletion operations in large directories are expensive because 
> re-allocations and copies of big arrays.
> # Dynamically allocate several MB continuous heap memory which will be 
> long-lived can easily cause full GC problem.
> # Even most children are removed later, but the directory INode still 
> occupies same size heap memory, since the ArrayList will never shrink.
> This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to 
> solve the problem suggested by [~shv]. 
> So the target of this JIRA is to implement a low memory footprint B-Tree and 
> use it to replace ArrayList. 
> If the elements size is not large (less than the maximum degree of B-Tree 
> node), the B-Tree only has one root node which contains an array for the 
> elements. And if the size grows large enough, it will split automatically, 
> and if elements are removed, then 

[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree

2015-09-29 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14936183#comment-14936183
 ] 

Yi Liu edited comment on HDFS-9053 at 9/30/15 4:04 AM:
---

Thanks [~szetszwo] for the comments.

{quote}
here are quite a few errors in description and the analysis. Let's hold on the 
patch for the moment and fix all these errors first.
{quote}
Sure, let's fix the errors in the description and the analysis first, thanks 
for pointing out this.

{quote}
> So totally 12+4+4+4+4+4+4 = 32 bytes on 32-bits system/JVM, and 12+4+8+4+8+4 
> = 40 bytes on 64-bits system/JVM. (I have not counted object alignment)
It seems that there is at least one more reference not yet counted since Node 
is non-static.
{quote}
I mean the *increased memory* compared with using ArrayList, and I have stated 
this in above comments.
The reference to Node is already counted, use 64-bits system/JVM as example, 
the first '8' is the reference to the B-Tree root.  The second '8' is the null 
reference for children in BTree#Node.  This doesn't count the memory parts 
which ArrayList also has: 12 bytes object overhead + 4 bytes int modCount + 8 
bytes reference to elements + 4 bytes int elementsSize + elements array memory.



Nicholas, I am going to fix the two places in the JIRA description and analysis 
in JIRA comments, which we find so far
# fix the time complexity of insert/delete for ArrayList
# fix the description about memory usage for B-Tree to add the system/JVM 
information and add more detailed information.

Any other errors do you find besides these? please tell me and appreciate if 
you can also help to edit them directly.


was (Author: hitliuyi):
Thanks [~szetszwo] for the comments.

{quote}
here are quite a few errors in description and the analysis. Let's hold on the 
patch for the moment and fix all these errors first.
{quote}
Sure, let's fix the errors in the description and the analysis first, thanks 
for pointing out this.

{quote}
> So totally 12+4+4+4+4+4+4 = 32 bytes on 32-bits system/JVM, and 12+4+8+4+8+4 
> = 40 bytes on 64-bits system/JVM. (I have not counted object alignment)
It seems that there is at least one more reference not yet counted since Node 
is non-static.
{quote}
I mean the *increased memory* compared with using ArrayList, and I have stated 
this in above comments.
The reference to Node is already counted, use 64-bits system/JVM as example, 
the first '8' is the reference to the B-Tree root.  The second '8' is the null 
reference for children in BTree#Node.  This doesn't count the memory parts 
which ArrayList also has: 12 bytes object overhead + 4 bytes int modCount + 8 
bytes reference to elements + 4 bytes int elementsSize.



Nicholas, I am going to fix the two places in the JIRA description and analysis 
in JIRA comments, which we find so far
# fix the time complexity of insert/delete for ArrayList
# fix the description about memory usage for B-Tree to add the system/JVM 
information and add more detailed information.

Any other errors do you find besides these? please tell me and appreciate if 
you can also help to edit them directly.

> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete, the time complexity is 
> O\(n), (the search is O(log n), but insertion/deleting causes re-allocations 
> and copies of arrays), for large directory, the operations are expensive.  If 
> the children grow to 1M size, the ArrayList will resize to > 1M capacity, so 
> need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) 
> continuous heap memory, it easily causes full GC in HDFS cluster where 
> namenode heap memory is already highly used.  I recap the 3 main issues:
> # Insertion/deletion operations in large directories are expensive because 
> re-allocations and copies of big arrays.
> # Dynamically allocate several MB continuous heap memory which will be 
> long-lived can easily cause full GC problem.
> # Even most children are removed later, but the directory INode still 
> occupies same size heap memory, since the ArrayList will never shrink.
> This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to 
> solve the problem suggested by [~shv]. 
> So 

[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree

2015-09-29 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14936183#comment-14936183
 ] 

Yi Liu edited comment on HDFS-9053 at 9/30/15 3:55 AM:
---

Thanks [~szetszwo] for the comments.

{quote}
here are quite a few errors in description and the analysis. Let's hold on the 
patch for the moment and fix all these errors first.
{quote}
Sure, let's fix the errors in the description and the analysis first, thanks 
for pointing out this.

{quote}
> So totally 12+4+4+4+4+4+4 = 32 bytes on 32-bits system/JVM, and 12+4+8+4+8+4 
> = 40 bytes on 64-bits system/JVM. (I have not counted object alignment)
It seems that there is at least one more reference not yet counted since Node 
is non-static.
{quote}
I mean the *increased memory* compared with using ArrayList, and I have stated 
this in above comments.
The reference to Node is already counted, use 64-bits system/JVM as example, 
the first '8' is the reference to the B-Tree root.  The second '8' is the null 
reference for children in BTree#Node.  This doesn't count the memory parts 
which ArrayList also has: 12 bytes object overhead + 4 bytes int modCount + 8 
bytes reference to elements + 4 bytes int elementsSize.



Nicholas, I am going to fix the two places in the JIRA description and analysis 
in JIRA comments, which we find so far
# fix the time complexity of insert/delete for ArrayList
# fix the description about memory usage for B-Tree to add the system/JVM 
information and add more detailed information.

Any other errors do you find besides these? please tell me and appreciate if 
you can also help to edit them directly.


was (Author: hitliuyi):
Thanks [~szetszwo] for the comments.

{quote}
here are quite a few errors in description and the analysis. Let's hold on the 
patch for the moment and fix all these errors first.
{quote}
Sure, let's fix the errors in the description and the analysis first, thanks 
for pointing out this.

{quote}
> So totally 12+4+4+4+4+4+4 = 32 bytes on 32-bits system/JVM, and 12+4+8+4+8+4 
> = 40 bytes on 64-bits system/JVM. (I have not counted object alignment)
It seems that there is at least one more reference not yet counted since Node 
is non-static.
{quote}
The reference to Node is already counted, use 64-bits system/JVM as example, 
the first '8' is the reference to the B-Tree root.  The second '8' is the null 
reference for children in BTree#Node, of course we are talking about the 
directory which is not large and only contains one B-Tree node.


Nicholas, I am going to fix the two places in the JIRA description and analysis 
in JIRA comments, which we find so far
# fix the time complexity of insert/delete for ArrayList
# fix the description about memory usage for B-Tree to add the system/JVM 
information and add more detailed information.

Any other errors do you find besides these? please tell me and appreciate if 
you can also help to edit them directly.

> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete, the time complexity is 
> O\(n), (the search is O(log n), but insertion/deleting causes re-allocations 
> and copies of arrays), for large directory, the operations are expensive.  If 
> the children grow to 1M size, the ArrayList will resize to > 1M capacity, so 
> need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) 
> continuous heap memory, it easily causes full GC in HDFS cluster where 
> namenode heap memory is already highly used.  I recap the 3 main issues:
> # Insertion/deletion operations in large directories are expensive because 
> re-allocations and copies of big arrays.
> # Dynamically allocate several MB continuous heap memory which will be 
> long-lived can easily cause full GC problem.
> # Even most children are removed later, but the directory INode still 
> occupies same size heap memory, since the ArrayList will never shrink.
> This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to 
> solve the problem suggested by [~shv]. 
> So the target of this JIRA is to implement a low memory footprint B-Tree and 
> use it to replace ArrayList. 
> If the elements size is not large (less than the maximum degree of B-Tree 
> node), the 

[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree

2015-09-28 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934634#comment-14934634
 ] 

Yi Liu edited comment on HDFS-9053 at 9/29/15 5:15 AM:
---

Thanks for the comments, Nicholas.

{quote}
Where are the numbers, especially the 4s, from? Do we assume a 32-bit world?
{quote}

{code}
public class BTree implements Iterable {
  ...
  private final int degree;
  private Node root;
  private int size;
  private transient int modCount = 0;
 ...
}

private final class Node {
static final int DEFAULT_CAPACITY = 5;
private Object[] elements;
private int elementsSize; 
private Object[] children;
private int childrenSize;
...
}
{code}
Sorry, I should use 64-bits system/JVM to describe, and details are:

Compared to ArrayList, we increases following things:
private final int degree;   <-   4 bytes Integer
private Node root;   <-  reference, 4 bytes on 32-bits 
system/JVM, 8 bytes on 64-bits system/JVM
private int size;<-  4 bytes Integer

{{Node}} object overhead   <--  12 bytes
private Object[] children; <-  null reference, 4 bytes on 
32-bits system/JVM, 8 bytes on 64-bits system/JVM
private int childrenSize;   <-  4 bytes Integer.

So totally  12+4+4+4+4+4+4 = 32 bytes on 32-bits system/JVM, and 12+4+8+4+8+4 = 
40 bytes on 64-bits system/JVM. (I have not counted object alignment)



was (Author: hitliuyi):
Thanks for the comments, Nicholas.

{quote}
Where are the numbers, especially the 4s, from? Do we assume a 32-bit world?
{quote}

{code}
public class BTree implements Iterable {
  ...
  private final int degree;
  private Node root;
  private int size;
  private transient int modCount = 0;
 ...
}

private final class Node {
static final int DEFAULT_CAPACITY = 5;
private Object[] elements;
private int elementsSize; 
private Object[] children;
private int childrenSize;
...
}
{code}
Sorry, I should use 64-bits system/JVM, and details are:

Compared to ArrayList, we increases following things:
private final int degree;   <-   4 bytes Integer
private Node root;   <-  reference, 4 bytes on 32-bits 
system/JVM, 8 bytes on 64-bits system/JVM
private int size;<-  4 bytes Integer

{{Node}} object overhead   <--  12 bytes
private Object[] children; <-  null reference, 4 bytes on 
32-bits system/JVM, 8 bytes on 64-bits system/JVM
private int childrenSize;   <-  4 bytes Integer.

So totally  12+4+4+4+4+4+4 = 32 bytes on 32-bits system/JVM, and 12+4+8+4+8+4 = 
40 bytes on 64-bits system/JVM. (I have not counted object alignment)


> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete/search, the time 
> complexity is O(log n), but insertion/deleting causes re-allocations and 
> copies of big arrays, so the operations are costly.  For example, if the 
> children grow to 1M size, the ArrayList will resize to > 1M capacity, so need 
> > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS 
> cluster where namenode heap memory is already highly used.  I recap the 3 
> main issues:
> # Insertion/deletion operations in large directories are expensive because 
> re-allocations and copies of big arrays.
> # Dynamically allocate several MB continuous heap memory which will be 
> long-lived can easily cause full GC problem.
> # Even most children are removed later, but the directory INode still 
> occupies same size heap memory, since the ArrayList will never shrink.
> This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to 
> solve the problem suggested by [~shv]. 
> So the target of this JIRA is to implement a low memory footprint B-Tree and 
> use it to replace ArrayList. 
> If the elements size is not large (less than the maximum degree of B-Tree 
> node), the B-Tree only has one root node which contains an array for the 
> elements. And if the size grows large enough, it will split automatically, 
> and if elements are removed, then B-Tree nodes can 

[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree

2015-09-28 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934598#comment-14934598
 ] 

Yi Liu edited comment on HDFS-9053 at 9/29/15 4:48 AM:
---

Thanks Jing for the further review. 
{quote}
Yeah, sorry for the delay.
{quote}
NP, thanks for the help :)

I also think it's a critical part of the code, thanks a lot for the review, 
[~szetszwo]! also welcome [~kihwal] and anyone who can review this.

{quote}
For insert/delete, the running time is O( n) for ArrayList.
{quote}
Yeah, if we count the re-allocations and copies of big arrays, I just mean the 
search time complexity before we do insert/delete.

{quote}
How about memory usage? One reason to use ArrayList instead of a more 
complicated data structure is to save memory. It seems to me that using B-Tree 
increases memory usage in general. It increases memory usage dramatically in 
some worst cases such as the average branching factor of the tree is small 
(this may also be a common case).
{quote}
That's a great question here.  I have given few description in the my second 
comment above: 
[here|https://issues.apache.org/jira/browse/HDFS-9053?focusedCommentId=14740814=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14740814],
 I would give a detailed explanation:
{quote}
in some worst cases such as the average branching factor of the tree is small
{quote}
Don't worry about this, for any B-Tree node except root, it contains at least 
min degree size of elements, otherwise there is merging or element shift to 
guarantee this property. Besides, we expand the elements in B-Tree node the 
same as ArrayList.

Generally say, the B-Tree in the patch increases *very few* memory usage, and 
can be ignored:
# *For normal and small size directories*, let's say it's (INode) children size 
is not large and then B-Tree only contains one node. Currently the default 
degree of B-Tree used in INodeDirectory is 1024, so the max degree is 2047, 
that means one B-Tree node can at most contain 2047 elements. And the children 
of the B-Tree root node is null, so the increased memory usage compared to 
ArrayList is: one object overhead + few variables, total increase about 
(12+4+4+4+4+4) = 32 bytes.
# *For large directories*, let's say it's (INode) children size is larger than 
the max degree and then B-Tree contains more than one nodes. For any B-Tree 
node except root, it contains at least min degree size elements, in our case, 
it's 1023, furthermore, we expand elements of B-Tree node same as ArrayList, so 
it's the same as ArrayList at this point, typically the worst case for B-Tree 
from memory point view is all elements are in-order, in this case, when we 
split a B-Tree node, we allocate the elements size of memory and do expanding, 
but no more elements are added later in the split left B-Tree node, and there 
is a waste of 1/3 size of elements array memory, but actually it's almost the 
same as the worst cases even for ArrayList if there is no/few elements added 
after expanding, but in practice, the elements are not in order.  Now back to 
the overhead, for B-Tree, for every B-Tree node, the increased memory usage is: 
one object overhead + one element pointer in the father + 2 variables, so total 
increase about (12 + 4+4+4) = 24 bytes for every new added B-Tree node, but one 
B-Tree node contains at least (degree - 1) elements, and at most (2* degree - 
1) elements.  Another small increased memory is the object overhead of the 
children in the B-Tree inner node.

In conclusion, the increased memory used for B-Tree compared to ArrayList is 
very small, for small/normal directories, it's only 32 bytes overhead which is 
even less than memory of one block in NN, for large directories, besides 
additional increased 12 bytes memory for the variables in B-Tree , it will 
increase about 24 bytes memory for every new added B-Tree node which can 
contain 1023 ~ 2047 INodes in the directory in our case.  We can ignore these 
few memory overhead for a directory. 

And last, the benefits of B-Tree is great and obvious as described in the JIRA 
description, also in the patch, I consider the memory overhead carefully while 
implementing the B-Tree. 

Please let me know if you have further questions. Thanks a lot.


was (Author: hitliuyi):
Thanks Jing for the further review. 
{quote}
Yeah, sorry for the delay.
{quote}
NP, thanks for the help :)

I also think it's a critical part of the code, thanks a lot for the review, 
[~szetszwo]! also welcome [~kihwal] and anyone who can review this.

{quote}
For insert/delete, the running time is O( n) for ArrayList.
{quote}
Yeah, if we count the re-allocations and copies of big arrays, I just mean the 
search time complexity before we do insert/delete.

{quote}
How about memory usage? One reason to use ArrayList instead of a more 
complicated data structure is to save memory. It seems to me that 

[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree

2015-09-22 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903725#comment-14903725
 ] 

Yi Liu edited comment on HDFS-9053 at 9/23/15 12:40 AM:


The Jenkins has some issue and the test failures are not related. 
The reason is: having multiple maven invocations going on at once sharing the 
same .m2 directory on the same machine, this patch includes new class file in 
Hadoop-common, when some other maven invocations run after this one on the same 
machine,  "mvn install" of test will replace the hadoop-common jar in .m2 
directory, so the test failures in Jenkins show NoClassDefFoundError for the 
new added class in hadoop-common.

I will find some other time to re-trigger the Jenkins when there are few 
Jenkins jobs on the list.


was (Author: hitliuyi):
The Jenkins has some issue and the test failures are not related. 
The reason is: having multiple maven invocations going on at once sharing the 
same .m2 directory on the same machine, this patch includes new class file in 
Hadoop-common, when some other maven invocations run after this one on the same 
machine,  "mvn install" of test will replace the hadoop-common jar in .m2 
directory, so the test failures in Jenkins show NoClassDefFoundError for the 
new added class in hadoop-common.

> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete/search, the time 
> complexity is O(log n), but insertion/deleting causes re-allocations and 
> copies of big arrays, so the operations are costly.  For example, if the 
> children grow to 1M size, the ArrayList will resize to > 1M capacity, so need 
> > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS 
> cluster where namenode heap memory is already highly used.  I recap the 3 
> main issues:
> # Insertion/deletion operations in large directories are expensive because 
> re-allocations and copies of big arrays.
> # Dynamically allocate several MB continuous heap memory which will be 
> long-lived can easily cause full GC problem.
> # Even most children are removed later, but the directory INode still 
> occupies same size heap memory, since the ArrayList will never shrink.
> This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to 
> solve the problem suggested by [~shv]. 
> So the target of this JIRA is to implement a low memory footprint B-Tree and 
> use it to replace ArrayList. 
> If the elements size is not large (less than the maximum degree of B-Tree 
> node), the B-Tree only has one root node which contains an array for the 
> elements. And if the size grows large enough, it will split automatically, 
> and if elements are removed, then B-Tree nodes can merge automatically (see 
> more: https://en.wikipedia.org/wiki/B-tree).  It will solve the above 3 
> issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree

2015-09-14 Thread Yi Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14740842#comment-14740842
 ] 

Yi Liu edited comment on HDFS-9053 at 9/15/15 12:35 AM:


To integrate B-Tree with INodeDirectory, it's natural besides directory exposes 
{{getChildrenList}} and returns {{ReadOnlyList}}, B-Tree will be updated to 
implement Collection (now it implements Iterable), it's not a list. Actually 
most other places want to iterate all elements, and some places want to iterate 
starting from certain child. So my plan is adding {{ReadOnlyCollection}} and 
{{ReadOnlyList}} extends it, and getChildren returns {{ReadOnlyCollection}}. 
The {{ReadOnlyCollection}} has an interface to allow creating an Iterator 
starting from certain child. For snapshot, it makes sense to keep current 
behavior, still use list to keep the CREATED/DELETED diff.


was (Author: hitliuyi):
To integrate B-Tree with INodeDirectory, it's natural besides directory exposes 
{{getChildrenList}} and returns {{ReadOnlyList}}, B-Tree implements Iterable, 
it's not a list. Actually most other places want to iterate all elements, and 
some places want to iterate starting from certain child. So my plan is add an 
new Iterable interface which extends {{java.lang.Iterable}} and allow creating 
an iterator starting from certain child. Then in directory, getChildren returns 
{{ReadOnlyIterator}}, and {{ReadOnlyList}} implements it the ReadOnlyIterator. 
For snapshot, it makes sense to keep current behavior, still use the 
{{ReadOnlyList}}.

> Support large directories efficiently using B-Tree
> --
>
> Key: HDFS-9053
> URL: https://issues.apache.org/jira/browse/HDFS-9053
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode
>Reporter: Yi Liu
>Assignee: Yi Liu
>Priority: Critical
> Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 
> (BTree).patch
>
>
> This is a long standing issue, we were trying to improve this in the past.  
> Currently we use an ArrayList for the children under a directory, and the 
> children are ordered in the list, for insert/delete/search, the time 
> complexity is O(log n), but insertion/deleting causes re-allocations and 
> copies of big arrays, so the operations are costly.  For example, if the 
> children grow to 1M size, the ArrayList will resize to > 1M capacity, so need 
> > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS 
> cluster where namenode heap memory is already highly used.  I recap the 3 
> main issues:
> # Insertion/deletion operations in large directories are expensive because 
> re-allocations and copies of big arrays.
> # Dynamically allocate several MB continuous heap memory which will be 
> long-lived can easily cause full GC problem.
> # Even most children are removed later, but the directory INode still 
> occupies same size heap memory, since the ArrayList will never shrink.
> This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to 
> solve the problem suggested by [~shv]. 
> So the target of this JIRA is to implement a low memory footprint B-Tree and 
> use it to replace ArrayList. 
> If the elements size is not large (less than the maximum degree of B-Tree 
> node), the B-Tree only has one root node which contains an array for the 
> elements. And if the size grows large enough, it will split automatically, 
> and if elements are removed, then B-Tree nodes can merge automatically (see 
> more: https://en.wikipedia.org/wiki/B-tree).  It will solve the above 3 
> issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)