[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15282164#comment-15282164 ] Sangjin Lee commented on HDFS-9053: --- There has been no movement on this for a while. We'll need to move it out of scope for 2.8.0 soon. Let me know if you disagree. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch, > HDFS-9053.007.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14969206#comment-14969206 ] Yi Liu commented on HDFS-9053: -- {quote} The advantages of having one array are 1) smaller memory footprint and 2) not requiring to maintain the invariant above. {quote} Agree, my concern is it requires one array contains two different data types and more complex code logic. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch, > HDFS-9053.007.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14969202#comment-14969202 ] Yi Liu commented on HDFS-9053: -- {quote} Could you give some examples? {quote} https://github.com/google/btree The google btree implementation using Go language. https://code.google.com/p/cpp-btree/ The btree in cpp Also there are some other btree implementations in github which may be not popular. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch, > HDFS-9053.007.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14969166#comment-14969166 ] Tsz Wo Nicholas Sze commented on HDFS-9053: --- > Actually most implementations I saw use two array. Could you give some examples? Two arrays are closely related to each other as shown in the javadoc below. {code} + * It must at all times maintain invariant that either: + * -childrenSize == 0, elementsSize unconstrained + * -childrenSize == elementsSize + 1 {code} The advantages of having one array are 1) smaller memory footprint and 2) not requiring to maintain the invariant above. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch, > HDFS-9053.007.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968754#comment-14968754 ] Yi Liu commented on HDFS-9053: -- Thanks for the discussion, Nicholas. {quote} BTW, why do we need two arrays for elements and children. I think BTree implementation naturally needs only one array. {quote} Actually most implementations I saw use two array. One array stores the elements, and another array stores the reference to children nodes (childrenSize == elementsSize+1 for non leaf nodes). Having them in one array makes: 1) the array contains two different types, I think it's strange. 2) more complicated logic and not easy to understand. {quote} When the #children is small, ArrayList is just fine. Let's use BTree only when #children is larger than a threshold {quote} Yeah, I ever did this as you suggested, please see {{005}} patch, I added a {{SortedCollection}} which wraps a custom array list implementation and the btree. - I didn't do as your suggestion of "The field in INodeDirectory is List children which may refer to either an ArrayList or a BTreeList. We may replace the list at runtime", is because: It's hard to use a List, since when we use ArrayList, we do searching before index/delete and access through index, but BTree is in-order and we don't access through index. I personally think this approach is a bit complex. Do you have further suggestion about it? Nicholas. On the other hand, back to the btree, Jing, I see you are not in favor of btree extend Node, is it possible to make some change and then you can accept? Any suggestions? Appreciate you two to spend time on the discussion, thanks a lot! > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch, > HDFS-9053.007.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968454#comment-14968454 ] Tsz Wo Nicholas Sze commented on HDFS-9053: --- When the #children is small, ArrayList is just fine. Let's use BTree only when #children is larger than a threshold, say 4k or 8k. After an ArrayList is converted to BTree, it may not need to be converted back to ArrayList for simplicity and also that the case is uncommon. BTW, why do we need two arrays for elements and children. I think BTree implementation naturally needs only one array. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch, > HDFS-9053.007.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968445#comment-14968445 ] Yi Liu commented on HDFS-9053: -- Got it now. Thanks a lot [~jingzhao]! Let's recalculate again. For original btree implementation, it increases 44 bytes which is my estimation and ignore alignment. Now I have looked into the real memory layout, it actually increases (40 + 40) - 40 arraylist = 40 bytes. And I can do small improvement to remove {{degree}} variable 4 bytes + 4 bytes alignment/padding gap = 8 bytes. So finally if we don't let BTree extend Node, it increases *32 bytes* for a directly. 32 bytes memory increment for a directory is fine for me and was my original thought. As in your example, if we have 1M directories, then we only increase heap size by 32 MB. I also respect Nicholas' comment, if we all think it's OK, I am happy to do this :). {noformat} org.apache.hadoop.util.BTree object internals: OFFSET SIZE TYPE DESCRIPTIONVALUE 016 (object header)N/A 16 4 int BTree.degree N/A 20 4 int BTree.size N/A 24 4 int BTree.modCount N/A 28 4 (alignment/padding gap)N/A 32 8 Node BTree.root N/A Instance size: 40 bytes (estimated, the sample instance is not available) Space losses: 4 bytes internal + 0 bytes external = 4 bytes total {noformat} {noformat} org.apache.hadoop.util.BTree.Node object internals: OFFSET SIZE TYPE DESCRIPTIONVALUE 016 (object header)N/A 16 4 int Node.elementsSize N/A 20 4 int Node.childrenSize N/A 24 8 Object[] Node.elements N/A 32 8 Object[] Node.children N/A Instance size: 40 bytes (estimated, the sample instance is not available) Space losses: 0 bytes internal + 0 bytes external = 0 bytes total {noformat} > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch, > HDFS-9053.007.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968424#comment-14968424 ] Jing Zhao commented on HDFS-9053: - I understand the latest patch only increases 8 bytes for each directory. But I'm not in favor of the latest change letting BTree extend Node. So I think we may first want to understand the total number of memory increased by the original patch. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch, > HDFS-9053.007.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968277#comment-14968277 ] Yi Liu commented on HDFS-9053: -- Thanks [~jingzhao]! {quote} My concern is if the number of directories is limited, then maybe increasing 44 bytes per directory is fine. E.g., for a big cluster with >100M files/directories, if we only have 1M directories, then we only increase the heap size by 44 MB. Compared with the total heap size this may be just <0.1% increase. {quote} Each directory increasing 44 bytes is the old implementation. In the latest patch, each directory only increase *8* bytes. So it's quite small for directory. I know some users have more than 500M files, and some directory contains quite lots of files, usually this kind of large directory is accessed most frequently, just as "barrel principle", this becomes the bottleneck of NN. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch, > HDFS-9053.007.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968202#comment-14968202 ] Jing Zhao commented on HDFS-9053: - Thanks for all the work here, [~hitliuyi]! Before we decide what optimization should be done, maybe we can first check the typical percentage of INodeDirectories (among all the INodes) in a cluster. My concern is if the number of directories is limited, then maybe increasing 44 bytes per directory is fine. E.g., for a big cluster with >100M files/directories, if we only have 1M directories, then we only increase the heap size by 44 MB. Compared with the total heap size this may be just <0.1% increase. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch, > HDFS-9053.007.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14964775#comment-14964775 ] Yi Liu commented on HDFS-9053: -- The test failures are not related. I plan to support {{shrinkable}} in future task. Hi [~jingzhao], could you help to review the new patch again? (The diff is small compared with previous patch you reviewed) [~szetszwo], I can change the default degree to 2K if you like. BTW, I have checked the logic of btree many times and done a lot of tests, and I am confident of the correctness of btree. Thanks a lot. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch, > HDFS-9053.007.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14964679#comment-14964679 ] Hadoop QA commented on HDFS-9053: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 23m 13s | Pre-patch trunk has 1 extant Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 8 new or modified test files. | | {color:green}+1{color} | javac | 8m 40s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 11m 28s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 28s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 2m 9s | The applied patch generated 7 new checkstyle issues (total was 0, now 7). | | {color:green}+1{color} | whitespace | 0m 11s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 2m 5s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 39s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 5m 24s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | common tests | 8m 44s | Tests failed in hadoop-common. | | {color:red}-1{color} | hdfs tests | 52m 27s | Tests failed in hadoop-hdfs. | | | | 115m 59s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.ha.TestZKFailoverController | | | hadoop.ipc.TestIPC | | | hadoop.fs.viewfs.TestViewFsAtHdfsRoot | | | hadoop.hdfs.TestLeaseRecovery2 | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12767526/HDFS-9053.007.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 9cb5d35 | | Pre-patch Findbugs warnings | https://builds.apache.org/job/PreCommit-HDFS-Build/13075/artifact/patchprocess/trunkFindbugsWarningshadoop-hdfs.html | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/13075/artifact/patchprocess/diffcheckstylehadoop-common.txt | | hadoop-common test log | https://builds.apache.org/job/PreCommit-HDFS-Build/13075/artifact/patchprocess/testrun_hadoop-common.txt | | hadoop-hdfs test log | https://builds.apache.org/job/PreCommit-HDFS-Build/13075/artifact/patchprocess/testrun_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/13075/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf901.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/13075/console | This message was automatically generated. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch, > HDFS-9053.007.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprin
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14964433#comment-14964433 ] Yi Liu commented on HDFS-9053: -- Update the patch: # Fix some checkstyles and the whitespace. # Update few method names in btree to be more readable. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch, > HDFS-9053.007.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959988#comment-14959988 ] Yi Liu commented on HDFS-9053: -- Hi Nicholas, sorry, I may bypass some description here. For 2047, I wanted to say it just an example threshold of small elements size. Currently the small elements size is an assumption value, actually we can set the degree of BTree as any value we want to. If we want 4K as the threshold of small elements size, we can set the degree of B-Tree to 2K, then max degree is (4K - 1). (I should make the description clear..) Thanks. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959972#comment-14959972 ] Tsz Wo Nicholas Sze commented on HDFS-9053: --- > For small elements size (assume # < max degree which is 2047), ... Do I miss > something? According to [your comment|https://issues.apache.org/jira/browse/HDFS-9053?focusedCommentId=14950498&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14950498] (also copied below), you were saying that B-Tree only increased 8 bytes when #children < 4K, i.e. when 2047 < #children < 4K. Is it still true? If not, how much memory is needed when 2047 < #children < 4K? {quote} I find a good approach to improve B-Tree memory overhead to make it only increase 8 bytes memory usage comparing with using ArrayList for small elements size. So we don't need to use ArrayList when #children is small (< 4K), and we can always use the BTree. {quote} > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14959955#comment-14959955 ] Yi Liu commented on HDFS-9053: -- Thanks [~szetszwo] for the comments. {quote} >> 24 8 Object[] Node.elements N/A >> 32 8 Object[] Node.children N/A It only counts the reference but array objects are not counted. So the BTree overhead is still a lot more than ArrayList. {quote} For small elements size (assume # < max degree which is 2047), the {{children}} is null reference, so there is no array object of {{children}} here, and for {{elements}}, {{ArrayList}} also has it. So as described above, {{BTree}} increases 8 bytes compared with {{ArrayList}} for small size elements. Do I miss something? > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14958688#comment-14958688 ] Tsz Wo Nicholas Sze commented on HDFS-9053: --- {noformat} 24 8 Object[] Node.elements N/A 32 8 Object[] Node.children N/A {noformat} It only counts the reference but array objects are not counted. So the BTree overhead is still a lot more than ArrayList. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14951674#comment-14951674 ] Hadoop QA commented on HDFS-9053: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 17m 35s | Findbugs (version ) appears to be broken on trunk. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 8 new or modified test files. | | {color:green}+1{color} | javac | 7m 55s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 33s | There were no new javadoc warning messages. | | {color:red}-1{color} | release audit | 0m 20s | The applied patch generated 1 release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 13s | The applied patch generated 8 new checkstyle issues (total was 0, now 8). | | {color:red}-1{color} | whitespace | 0m 10s | The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 30s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 35s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 4m 25s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | common tests | 7m 7s | Tests passed in hadoop-common. | | {color:green}+1{color} | hdfs tests | 190m 7s | Tests passed in hadoop-hdfs. | | | | 241m 41s | | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12765801/HDFS-9053.006.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / def374e | | Release Audit | https://builds.apache.org/job/PreCommit-HDFS-Build/12911/artifact/patchprocess/patchReleaseAuditProblems.txt | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/12911/artifact/patchprocess/diffcheckstylehadoop-common.txt | | whitespace | https://builds.apache.org/job/PreCommit-HDFS-Build/12911/artifact/patchprocess/whitespace.txt | | hadoop-common test log | https://builds.apache.org/job/PreCommit-HDFS-Build/12911/artifact/patchprocess/testrun_hadoop-common.txt | | hadoop-hdfs test log | https://builds.apache.org/job/PreCommit-HDFS-Build/12911/artifact/patchprocess/testrun_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/12911/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf902.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/12911/console | This message was automatically generated. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14951435#comment-14951435 ] Yi Liu commented on HDFS-9053: -- The Jenkins conflicts with other builds, let me re-trigger it later. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14950856#comment-14950856 ] Hadoop QA commented on HDFS-9053: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 22m 41s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 8 new or modified test files. | | {color:green}+1{color} | javac | 8m 55s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 11m 28s | There were no new javadoc warning messages. | | {color:red}-1{color} | release audit | 0m 22s | The applied patch generated 1 release audit warnings. | | {color:red}-1{color} | checkstyle | 2m 6s | The applied patch generated 8 new checkstyle issues (total was 0, now 8). | | {color:red}-1{color} | whitespace | 0m 10s | The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 47s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 38s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 4m 47s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | common tests | 7m 48s | Tests passed in hadoop-common. | | {color:red}-1{color} | hdfs tests | 120m 50s | Tests failed in hadoop-hdfs. | | | | 181m 56s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.hdfs.TestWriteRead | | | hadoop.hdfs.server.namenode.ha.TestDNFencing | | | hadoop.hdfs.TestHFlush | | | hadoop.security.TestPermission | | | hadoop.hdfs.TestParallelRead | | | hadoop.fs.viewfs.TestViewFsHdfs | | | hadoop.hdfs.TestBlockReaderLocalLegacy | | | hadoop.hdfs.server.namenode.TestXAttrConfigFlag | | | hadoop.hdfs.TestPread | | | hadoop.hdfs.TestMiniDFSCluster | | | hadoop.hdfs.TestDFSStripedOutputStream | | | hadoop.hdfs.TestWriteConfigurationToDFS | | | hadoop.hdfs.TestDatanodeRegistration | | | hadoop.hdfs.web.TestWebHDFSXAttr | | | hadoop.hdfs.TestDFSRollback | | | hadoop.hdfs.TestDataTransferKeepalive | | | hadoop.hdfs.server.namenode.ha.TestDelegationTokensWithHA | | | hadoop.hdfs.TestDatanodeConfig | | | hadoop.hdfs.TestDFSFinalize | | | hadoop.hdfs.server.namenode.ha.TestGetGroupsWithHA | | | hadoop.fs.TestWebHdfsFileContextMainOperations | | | hadoop.fs.TestGlobPaths | | | hadoop.hdfs.TestDFSShell | | | hadoop.hdfs.tools.TestDFSHAAdminMiniCluster | | | hadoop.hdfs.TestSeekBug | | | hadoop.fs.loadGenerator.TestLoadGenerator | | | hadoop.hdfs.TestCrcCorruption | | | hadoop.fs.contract.hdfs.TestHDFSContractMkdir | | | hadoop.hdfs.TestAbandonBlock | | | hadoop.hdfs.TestGetFileChecksum | | | hadoop.hdfs.TestSafeModeWithStripedFile | | | hadoop.hdfs.tools.offlineImageViewer.TestOfflineImageViewerForContentSummary | | | hadoop.hdfs.TestFileCreationDelete | | | hadoop.hdfs.TestReadWhileWriting | | | hadoop.fs.viewfs.TestViewFileSystemWithAcls | | | hadoop.fs.contract.hdfs.TestHDFSContractConcat | | | hadoop.hdfs.TestDFSStripedOutputStreamWithFailure010 | | | hadoop.hdfs.util.TestDiff | | | hadoop.hdfs.security.TestDelegationToken | | | hadoop.fs.TestSymlinkHdfsDisable | | | hadoop.fs.contract.hdfs.TestHDFSContractRootDirectory | | | hadoop.hdfs.TestMissingBlocksAlert | | | hadoop.hdfs.TestBlocksScheduledCounter | | | hadoop.hdfs.TestSmallBlock | | | hadoop.cli.TestDeleteCLI | | | hadoop.hdfs.TestDFSClientRetries | | | hadoop.fs.viewfs.TestViewFsWithXAttrs | | | hadoop.hdfs.TestDFSMkdirs | | | hadoop.hdfs.tools.TestDFSAdmin | | | hadoop.hdfs.server.namenode.TestFavoredNodesEndToEnd | | | hadoop.hdfs.server.namenode.TestNameNodeMXBean | | | hadoop.hdfs.web.TestWebHDFSForHA | | | hadoop.fs.viewfs.TestViewFsDefaultValue | | | hadoop.fs.contract.hdfs.TestHDFSContractOpen | | | hadoop.hdfs.server.namenode.TestDeleteRace | | | hadoop.hdfs.TestFSInputChecker | | | hadoop.hdfs.web.TestWebHdfsWithAuthenticationFilter | | | hadoop.fs.contract.hdfs.TestHDFSContractRename | | | hadoop.hdfs.server.namenode.ha.TestXAttrsWithHA | | | hadoop.hdfs.server.namenode.ha.TestHAMetrics | | | hadoop.cli.TestErasureCodingCLI | | | hadoop.hdfs.TestRollingUpgradeRollback | | | hadoop.hdfs.TestRemoteBlockReader | | | hadoop.hdfs.server.namenode.TestNameNodeResourceChecker | | | hadoop.hdfs.tools.TestDFSAdminWithHA | | | hadoop.hdfs.TestBlockStoragePolicy | | | hadoop.hdfs.TestLeaseRecovery | | | hadoop.fs.viewfs.TestViewFsAtHdfsRoot | | | hadoop.hdfs.TestBlockReaderLocal | | | hadoop.fs.contract.hdfs.TestHDFSContractGetFileStatus | | | hadoop.cli.TestCryptoAdminCLI | | | hadoop.h
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14950498#comment-14950498 ] Yi Liu commented on HDFS-9053: -- I find a good approach to improve B-Tree overhead to make it only increase *8 bytes* memory usage comparing with using {{ArrayList}} for small elements size. So we don't need to use ArrayList when #children is small (< 4K), and we can always use the {{BTree}}. The main idea is to let {{BTree}} extend the BTree Node, then we don't need a separate root node, since {{BTree}} itself is the root. {noformat} java.util.ArrayList object internals: OFFSET SIZE TYPE DESCRIPTIONVALUE 016 (object header)N/A 16 4 int AbstractList.modCount N/A 20 4 (alignment/padding gap)N/A 24 4 int ArrayList.size N/A 28 4 (alignment/padding gap)N/A 32 8 Object[] ArrayList.elementData N/A Instance size: 40 bytes (estimated, the sample instance is not available) {noformat} {noformat} org.apache.hadoop.util.btree.BTree object internals: OFFSET SIZE TYPE DESCRIPTIONVALUE 016 (object header)N/A 16 4 int Node.elementsSize N/A 20 4 int Node.childrenSize N/A 24 8 Object[] Node.elements N/A 32 8 Object[] Node.children N/A 40 4 int BTree.size N/A 44 4 int BTree.modCount N/A Instance size: 48 bytes (estimated, the sample instance is not available) {noformat} We can see {{BTree}} only increases *8 bytes* comparing with {{ArrayList}} for a {{INodeDirectory}}. [~jingzhao], [~szetszwo], please look at the new patch {{006}}. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14950084#comment-14950084 ] Yi Liu commented on HDFS-9053: -- The increased memory usage is only 8 bytes in {{SortedCollection}} comparing with {{ArrayList}} if the elements size is small (<= 4K). {noformat} java.util.ArrayList object internals: OFFSET SIZE TYPE DESCRIPTIONVALUE 016 (object header)N/A 16 4 int AbstractList.modCount N/A 20 4 (alignment/padding gap)N/A 24 4 int ArrayList.size N/A 28 4 (alignment/padding gap)N/A 32 8 Object[] ArrayList.elementData N/A Instance size: 40 bytes (estimated, the sample instance is not available) {noformat} {noformat} org.apache.hadoop.hdfs.util.SortedCollection object internals: OFFSET SIZE TYPE DESCRIPTIONVALUE 016 (object header)N/A 16 4 int SortedCollection.initCapacity N/A 20 4 int SortedCollection.size N/A 24 4 int SortedCollection.modCount N/A 28 4 int SortedCollection.degreeN/A 32 8 Object[] SortedCollection.elements N/A 40 8BTree SortedCollection.btree N/A Instance size: 48 bytes (estimated, the sample instance is not available) {noformat} Hi [~jingzhao], [~szetszwo], could you review the new patch? Thanks. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch, HDFS-9053.005.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14950082#comment-14950082 ] Yi Liu commented on HDFS-9053: -- Update the patch. The new patch uses an array list if the children size is small (<= 4K), otherwise use B-Tree. The new patch includes following changes: # add {{SortedCollection}} which uses an array list to store elements if the size is small, otherwise use B-Tree. It implements a shrinkable array list and control expanding. The merits comparing with using java ArrayList in it are: (1) Fewer memory: save the object overhead/alignment of ArrayList. (2) The max capacity is 4K, so no need to expand to capacity larger than 4K (3) Shrinkable: if the elements size becomes few, the internal array will shrink. # Add more long tests for the {{B-Tree}} and {{SortedCollection}}. I am still running the long running tests locally, they all success so far. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch, HDFS-9053.005.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949847#comment-14949847 ] Yi Liu commented on HDFS-9053: -- Hi [~szetszwo], thanks for your comments. I will do as your suggestion: use an array list if the children size is small (<= 4K), otherwise use B-Tree. But there may be some differences for the implementation details. {quote} The field in INodeDirectory is List children which may refer to either an ArrayList or a BTreeList. We may replace the list at runtime {quote} It's hard to use a List, since when we use ArrayList, we do searching before index/delete and access through index, but BTree is in-order and we don't access through index. So I think we do it in following way: # actually I made an initial patch of switching between array list and b-tree few days ago. The logic of ArrayList is not complicated, so I implement it in a new class and support shrinkable, in the new class, I control the array and expanding, also the new class keeps reference to a b-tree, if the elements size becomes large, it switches to use the b-tree. The new class is an in-order data structure, not like ArrayList which we need to search before operating. In INodeDirectory, we just need to use the new class, another reason of I implement a new data structure class and don't do the switching in INodeDirectory is: we should have a {{SWITCH_THRESHOLD}} for switching from array list to b-tree, and need a low water mark to switch back, they should not be the same value, otherwise, the switching becomes frequent at some point, so I don't want to expose too many internal logic of switching in INodeDirectly. But the final memory usage of the new data structure is the same as ArrayList, even it has a reference to b-tree, and it supports Shrinkable I will give the memory usage after posting a new patch. {quote} I am also worry about the potiental bugs and the risk. If there is a bug in B-Tree, it is possible to lose one or more sub trees and, as a result, lose a lot of data. ArrayList is already well-tested. Anyway, we need more tests for the B-Tree, especially some long running random tests. {quote} Sure, I will add more tests for it, I have added many tests including some long running. I agree with that a bug-free data structure implementation is not easy, we should be careful and test the new implementations extensively :) > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14949749#comment-14949749 ] Tsz Wo Nicholas Sze commented on HDFS-9053: --- Hi [~hitliuyi], 44 bytes overhead is not small especially when most of the directories fall in this case. Just like that we did very hard but only able to save 40 bytes per replica in HDFS-8859. I agree that the current BTree implementation has an advantage that it can shrink. We may also implement our a shrinkable array list. The shrinkable feature does not seem a big deal since this is not a common case. > ... we may need write a class to wrap the two data structures, ... Wrapping is not needed. The field in INodeDirectory is List children which may refer to either an ArrayList or a BTreeList. We may replace the list at runtime. For the new B-Tree implementation. I am also worry about the potiental bugs and the risk. If there is a bug in B-Tree, it is possible to lose one or more sub trees and, as a result, lose a lot of data. ArrayList is already well-tested. Anyway, we need more tests for the B-Tree, especially some long running random tests. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14948629#comment-14948629 ] Yi Liu commented on HDFS-9053: -- Hi [~szetszwo], do you have further comments about it? Thanks. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14942004#comment-14942004 ] Yi Liu commented on HDFS-9053: -- Thanks [~szetszwo], good comment, I ever considered it carefully too. I want to convince you to allow me only use B-Tree here: # Use the case you said, the #children is small and < 4K. *1)* If children is < 2K, then B-Tree only contains a root. As we counted before, the increased overhead is only 44 bytes which is really very small for a directory, a continuous block is 80 bytes memory (detail below), so we only increase about 1/2 continuous block for a directory in NN. *2)* If the children is > 2K and < 4K, here we use 4K as example, the B-Tree at most contains 3 branches: 1 root node, 3 leaf nodes. One leaf node increase about (40 bytes + 16 bytes elements array overhead) = 56 bytes, and 1 root node is (40 bytes + 16 bytes elements array overhead + 16 bytes children overhead + 3 children * 8) = 96 bytes, the b-tree itself is 40 bytes, and we need to subtract the ArrayList (40 bytes + 16 bytes elements array overhead) = 56 bytes, so we at most increase 56 * 3 + 96 + 40 - 56 = 248 bytes overhead, but ArrayList of 4K references to INode needs more than 4K * 8 = 32K memory, then we can get that the increased memory is only *0.75%* {noformat} org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoContiguous object internals: OFFSET SIZE TYPE DESCRIPTIONVALUE 016 (object header)N/A 16 8 long Block.blockId N/A 24 8 long Block.numBytes N/A 32 8 long Block.generationStamp N/A 40 8 long BlockInfo.bcId N/A 48 2 short BlockInfo.replication N/A 50 6 (alignment/padding gap)N/A 56 8 LinkedElement BlockInfo.nextLinkedElementN/A 64 8 Object[] BlockInfo.triplets N/A 72 8 BlockUnderConstructionFeature BlockInfo.uc N/A Instance size: 80 bytes (estimated, the sample instance is not available) {noformat} # One advantage of B-Tree compared to ArrayList for small size children is: B-Tree can shrink. If the children of directory decreases from 4K to 2K, there are 2K * 8 = 16K memory wasteful if suing ArrayList. # On the other hand, if we do the switch between ArrayList and B-Tree, we may need write a class to wrap the two data structures, then it still needs 16bytes object overhead + 8 bytes additional reference = 24 bytes. How do you say? Thanks, Nicholas. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. An
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940673#comment-14940673 ] Tsz Wo Nicholas Sze commented on HDFS-9053: --- For the common case that the #children is small, say < 4K, ArrayList is better than B-Tree since it uses less memory and has similar running time as B-Tree. B-Tree is better for the special case that the #children is large. How about keep using ArrayList when #children < 4K and change to B-Tree when #children >= 4K. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14939361#comment-14939361 ] Hadoop QA commented on HDFS-9053: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 19m 49s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 8 new or modified test files. | | {color:green}+1{color} | javac | 8m 5s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 16s | There were no new javadoc warning messages. | | {color:red}-1{color} | release audit | 0m 16s | The applied patch generated 1 release audit warnings. | | {color:green}+1{color} | checkstyle | 2m 5s | There were no new checkstyle issues. | | {color:green}+1{color} | whitespace | 0m 10s | The patch has no lines that end in whitespace. | | {color:green}+1{color} | install | 1m 38s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 4m 23s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | common tests | 7m 46s | Tests failed in hadoop-common. | | {color:red}-1{color} | hdfs tests | 200m 8s | Tests failed in hadoop-hdfs. | | | | 255m 18s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.metrics2.impl.TestGangliaMetrics | | | hadoop.hdfs.TestRecoverStripedFile | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12764524/HDFS-9053.004.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 5db371f | | Release Audit | https://builds.apache.org/job/PreCommit-HDFS-Build/12758/artifact/patchprocess/patchReleaseAuditProblems.txt | | hadoop-common test log | https://builds.apache.org/job/PreCommit-HDFS-Build/12758/artifact/patchprocess/testrun_hadoop-common.txt | | hadoop-hdfs test log | https://builds.apache.org/job/PreCommit-HDFS-Build/12758/artifact/patchprocess/testrun_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/12758/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf903.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/12758/console | This message was automatically generated. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https:/
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14939144#comment-14939144 ] Yi Liu commented on HDFS-9053: -- {quote} the new class Node is. There is a "this" reference to the outer class. {quote} Good point, Nicholas, I have missed that. I just checked it in the memory layout after you pointed out. {noformat} org.apache.hadoop.util.BTree.Node object internals: OFFSET SIZE TYPE DESCRIPTIONVALUE 016 (object header)N/A 16 4 int Node.elementsSize N/A 20 4 int Node.childrenSize N/A 24 8 Object[] Node.elements N/A 32 8 Object[] Node.children N/A 40 8BTree Node.this$0N/A Instance size: 48 bytes (estimated, the sample instance is not available) {noformat} Let me change the B-Tree Node class to static, then we can save the reference to outer class. Will update the patch later. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14936627#comment-14936627 ] Tsz Wo Nicholas Sze commented on HDFS-9053: --- {quote} >It seems that there is at least one more reference not yet counted since > Node is non-static. I mean the increased memory compared with using ArrayList, and I have stated this in above comments. The reference to Node is already counted, use 64-bits system/JVM as example, the first '8' is the reference to the B-Tree root. The second '8' is the null reference for children in BTree#Node. This doesn't count the memory parts which ArrayList also has: 12 bytes object overhead + 4 bytes int modCount + 8 bytes reference to elements + 4 bytes int elementsSize + elements array memory. {quote} ArrayList is not an inner class but the new class Node is. There is a "this" reference to the outer class. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14936229#comment-14936229 ] Yi Liu commented on HDFS-9053: -- Have fixed following errors we found so far: 1. the time complexity of insert/delete for ArrayList 2. In the JIRA description, change reference size to 8 bytes, since in real case, we use 64-bits system/JVM 3. In the JIRA comments, change reference size to 8 bytes and re-calculate estimation of the memory usage, and description about what system/JVM we assume. [~szetszwo], let's fix the errors if you find more. Thanks. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14936183#comment-14936183 ] Yi Liu commented on HDFS-9053: -- Thanks [~szetszwo] for the comments. {quote} here are quite a few errors in description and the analysis. Let's hold on the patch for the moment and fix all these errors first. {quote} Sure, let's fix the errors in the description and the analysis first, thanks for pointing out this. {quote} > So totally 12+4+4+4+4+4+4 = 32 bytes on 32-bits system/JVM, and 12+4+8+4+8+4 > = 40 bytes on 64-bits system/JVM. (I have not counted object alignment) It seems that there is at least one more reference not yet counted since Node is non-static. {quote} The reference to Node is already counted, use 64-bits system/JVM as example, the first '8' is the reference to the B-Tree root. The second '8' is the null reference for children in BTree#Node, of course we are talking about the directory which is not large and only contains one B-Tree node. Nicholas, I am going to fix the two places in the JIRA description and analysis in JIRA comments, which we find so far # fix the time complexity of insert/delete for ArrayList # fix the description about memory usage for B-Tree to add the system/JVM information and add more detailed information. Any other errors do you find besides these? please tell me and appreciate if you can also help to edit them directly. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete/search, the time > complexity is O(log n), but insertion/deleting causes re-allocations and > copies of big arrays, so the operations are costly. For example, if the > children grow to 1M size, the ArrayList will resize to > 1M capacity, so need > > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS > cluster where namenode heap memory is already highly used. I recap the 3 > main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14935480#comment-14935480 ] Tsz Wo Nicholas Sze commented on HDFS-9053: --- > So totally 12+4+4+4+4+4+4 = 32 bytes on 32-bits system/JVM, and 12+4+8+4+8+4 > = 40 bytes on 64-bits system/JVM. (I have not counted object alignment) It seems that there is at least one more reference not yet counted since Node is non-static. There are quite a few errors in description and the analysis. Let's hold on the patch for the moment and fix all these errors first. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete/search, the time > complexity is O(log n), but insertion/deleting causes re-allocations and > copies of big arrays, so the operations are costly. For example, if the > children grow to 1M size, the ArrayList will resize to > 1M capacity, so need > > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS > cluster where namenode heap memory is already highly used. I recap the 3 > main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14935212#comment-14935212 ] Yi Liu commented on HDFS-9053: -- The three test failures are not related. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete/search, the time > complexity is O(log n), but insertion/deleting causes re-allocations and > copies of big arrays, so the operations are costly. For example, if the > children grow to 1M size, the ArrayList will resize to > 1M capacity, so need > > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS > cluster where namenode heap memory is already highly used. I recap the 3 > main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14935101#comment-14935101 ] Hadoop QA commented on HDFS-9053: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 20m 1s | Pre-patch trunk has 2 extant Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 8 new or modified test files. | | {color:green}+1{color} | javac | 7m 56s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 20s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 2m 6s | There were no new checkstyle issues. | | {color:red}-1{color} | whitespace | 0m 14s | The patch has 1 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 37s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 35s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 4m 17s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | common tests | 7m 42s | Tests failed in hadoop-common. | | {color:red}-1{color} | hdfs tests | 165m 33s | Tests failed in hadoop-hdfs. | | | | 220m 50s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.fs.shell.TestCopyPreserveFlag | | | hadoop.hdfs.web.TestWebHDFSOAuth2 | | | hadoop.hdfs.TestCrcCorruption | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12764210/HDFS-9053.003.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / d6fa34e | | Pre-patch Findbugs warnings | https://builds.apache.org/job/PreCommit-HDFS-Build/12739/artifact/patchprocess/trunkFindbugsWarningshadoop-common.html | | Pre-patch Findbugs warnings | https://builds.apache.org/job/PreCommit-HDFS-Build/12739/artifact/patchprocess/trunkFindbugsWarningshadoop-hdfs.html | | whitespace | https://builds.apache.org/job/PreCommit-HDFS-Build/12739/artifact/patchprocess/whitespace.txt | | hadoop-common test log | https://builds.apache.org/job/PreCommit-HDFS-Build/12739/artifact/patchprocess/testrun_hadoop-common.txt | | hadoop-hdfs test log | https://builds.apache.org/job/PreCommit-HDFS-Build/12739/artifact/patchprocess/testrun_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/12739/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/12739/console | This message was automatically generated. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete/search, the time > complexity is O(log n), but insertion/deleting causes re-allocations and > copies of big arrays, so the operations are costly. For example, if the > children grow to 1M size, the ArrayList will resize to > 1M capacity, so need > > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS > cluster where namenode heap memory is already highly used. I recap the 3 > main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the e
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14934862#comment-14934862 ] Yi Liu commented on HDFS-9053: -- Update the patch to address Jing's comments: 1. Call {{addOrReplace}} directly in {{INodeDirectory#replaceChild}} 2. add EK type to the ReadOnlyCollection/ReadOnlyList level. 3. Fix the bug in {{DirectoryWithSnapshotFeature#getChildrenList#iterator(K)#next()}}, and add a new test in {{TestLargeDirectory}} for it: list the snapshot of directory, since the snapshot of directory is large enough, so can hit the code path. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete/search, the time > complexity is O(log n), but insertion/deleting causes re-allocations and > copies of big arrays, so the operations are costly. For example, if the > children grow to 1M size, the ArrayList will resize to > 1M capacity, so need > > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS > cluster where namenode heap memory is already highly used. I recap the 3 > main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14934689#comment-14934689 ] Yi Liu commented on HDFS-9053: -- [~jingzhao], you have given a good review, and it hits the two places I ever considered carefully how to do it better. Thanks. {quote} In INodeDirectory#replaceChild, can we directly call addOrReplace instead of calling get first? {quote} I think we can use {{addOrReplace}} directly. Since there should be an INode with the name existing. To keep original behavior, I will remove the added one if {{addOrReplace}} returns null. {quote} Do you think we can avoid the following code? Maybe we can add the EK type to the ReadOnlyCollection/ReadOnlyList level? {quote} That's a good comment, I ever considered this carefully. I also thought adding the EK type as one of generic type to the ReadOnlyCollection/ReadOnlyList level, but I felt it looked not natural for a collection/list, and not all implementations of ReadOnlyList need to implement iterating starting from specified element, also I though it was OK since it's a private interface we use in HDFS. I will leave this comment in next version of patch, if you feel we'd better to do this, I will update it, I am OK with the both ways. {quote} DirectoryWithSnapshotFeature#getChildrenList#iterator(EK) forgot to increase pos? Maybe also add a new test for this (e.g., set a small ls limit and list a snapshot of a directory)? {quote} Great catch, let me update it, and add a new test in {{TestLargeDirectory}} to cover. {quote} In getListing, instead of continuing the iteration, can we just call size() to calculate the number of the remaining items? {quote} I ever tried to find a better way. {{size()}} will return the total number of elements in B-Tree, but we don't know the current index, so seems not able to calculate the number of the remaining items. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete/search, the time > complexity is O(log n), but insertion/deleting causes re-allocations and > copies of big arrays, so the operations are costly. For example, if the > children grow to 1M size, the ArrayList will resize to > 1M capacity, so need > > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS > cluster where namenode heap memory is already highly used. I recap the 3 > main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14934634#comment-14934634 ] Yi Liu commented on HDFS-9053: -- Thanks for the comments, Nicholas. {quote} Where are the numbers, especially the 4s, from? Do we assume a 32-bit world? {quote} {code} public class BTree> implements Iterable { ... private final int degree; private Node root; private int size; private transient int modCount = 0; ... } private final class Node { static final int DEFAULT_CAPACITY = 5; private Object[] elements; private int elementsSize; private Object[] children; private int childrenSize; ... } {code} Sorry, I should use 64-bits system/JVM, and details are: Compared to ArrayList, we increases following things: private final int degree; <- 4 bytes Integer private Node root; <- reference, 4 bytes on 32-bits system/JVM, 8 bytes on 64-bits system/JVM private int size;<- 4 bytes Integer {{Node}} object overhead <-- 12 bytes private Object[] children; <- null reference, 4 bytes on 32-bits system/JVM, 8 bytes on 64-bits system/JVM private int childrenSize; <- 4 bytes Integer. So totally 12+4+4+4+4+4+4 = 32 bytes on 32-bits system/JVM, and 12+4+8+4+8+4 = 40 bytes on 64-bits system/JVM. (I have not counted object alignment) > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete/search, the time > complexity is O(log n), but insertion/deleting causes re-allocations and > copies of big arrays, so the operations are costly. For example, if the > children grow to 1M size, the ArrayList will resize to > 1M capacity, so need > > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS > cluster where namenode heap memory is already highly used. I recap the 3 > main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14934611#comment-14934611 ] Tsz Wo Nicholas Sze commented on HDFS-9053: --- > Yeah, if we count the re-allocations and copies of big arrays, I just mean > the search time complexity before we do insert/delete. If we count only the search time complexity before we do insert/delete, it should be called "search time complexity" but not insert/delete time complexity. > ... total increase about (12+4+4+4+4+4) = 32 bytes. Where are the numbers, especially the 4s, from? Do we assume a 32-bit world? > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete/search, the time > complexity is O(log n), but insertion/deleting causes re-allocations and > copies of big arrays, so the operations are costly. For example, if the > children grow to 1M size, the ArrayList will resize to > 1M capacity, so need > > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS > cluster where namenode heap memory is already highly used. I recap the 3 > main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14934598#comment-14934598 ] Yi Liu commented on HDFS-9053: -- Thanks Jing for the further review. {quote} Yeah, sorry for the delay. {quote} NP, thanks for the help :) I also think it's a critical part of the code, thanks a lot for the review, [~szetszwo]! also welcome [~kihwal] and anyone who can review this. {quote} For insert/delete, the running time is O( n) for ArrayList. {quote} Yeah, if we count the re-allocations and copies of big arrays, I just mean the search time complexity before we do insert/delete. {quote} How about memory usage? One reason to use ArrayList instead of a more complicated data structure is to save memory. It seems to me that using B-Tree increases memory usage in general. It increases memory usage dramatically in some worst cases such as the average branching factor of the tree is small (this may also be a common case). {quote} That's a great question here. I have given few description in the my second comment above: [here|https://issues.apache.org/jira/browse/HDFS-9053?focusedCommentId=14740814&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14740814], I would give a detailed explanation: Generally say, the B-Tree in the patch increases *very few* memory usage, and can be ignored: # *For normal and small size directories*, let's say it's (INode) children size is not large and then B-Tree only contains one node. Currently the default degree of B-Tree used in INodeDirectory is 1024, so the max degree is 2047, that means one B-Tree node can at most contain 2047 elements. And the children of the B-Tree root node is null, so the increased memory usage compared to ArrayList is: one object overhead + few variables, total increase about (12+4+4+4+4+4) = 32 bytes. # *For large directories*, let's say it's (INode) children size is larger than the max degree and then B-Tree contains more than one nodes. For any B-Tree node except root, it will contains at least min degree size elements, in our case, it's 1023, furthermore, we expand elements of B-Tree node same as ArrayList, so it's the same as ArrayList at this point, typically the worst case for B-Tree from memory point view is all elements are in-order, in this case, when we split a B-Tree node, we allocate the elements size of memory and do expanding, but no more elements are added later in the split left B-Tree node, and there is a waste of 1/3 size of elements array memory, but actually it's almost the same as the worst cases even for ArrayList if there is no/few elements added, but in practice, the elements are not in order. Now back to the overhead, for B-Tree, for every B-Tree node, the increased memory usage is: one object overhead + one element pointer in the father + 2 variables, so total increase about (12 + 4+4+4) = 24 bytes for every new added B-Tree node, but one B-Tree node contains at least (degree - 1) elements, and at most (2* degree - 1) elements. Another small increased memory is the object overhead of the children in the B-Tree inner node. In conclusion, the increased memory used for B-Tree compared to ArrayList is very small, for small/normal directories, it's only 32 bytes overhead which is even less than memory of one block in NN, for large directories, besides additional increased 12 bytes memory for the variables in B-Tree , it will increase about 24 bytes memory for every new added B-Tree node which can contain 1023 ~ 2047 INodes in the directory in our case. We can ignore these few memory overhead for a directory. And last, the benefits of B-Tree is great and obvious as described in the JIRA description, also in the patch, I consider the memory overhead carefully while implementing the B-Tree. Please let me know if you have further questions. Thanks a lot. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete/search, the time > complexity is O(log n), but insertion/deleting causes re-allocations and > copies of big arrays, so the operations are costly. For example, if the > children grow to 1M size, the ArrayList will resize to > 1M capacity, so need > > 1M
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14934459#comment-14934459 ] Tsz Wo Nicholas Sze commented on HDFS-9053: --- The concept of using B-Tree seems good. > ... Currently we use an ArrayList for the children under a directory, and > the children are ordered in the list, for insert/delete/search, the time > complexity is O(log n) ... For insert/delete, the running time is O( n) for ArrayList. How about memory usage? One reason to use ArrayList instead of a more complicated data structure is to save memory. It seems to me that using B-Tree increases memory usage in general. It increases memory usage dramatically in some worst cases such as the average branching factor of the tree is small (this may also be a common case). > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete/search, the time > complexity is O(log n), but insertion/deleting causes re-allocations and > copies of big arrays, so the operations are costly. For example, if the > children grow to 1M size, the ArrayList will resize to > 1M capacity, so need > > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS > cluster where namenode heap memory is already highly used. I recap the 3 > main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14934414#comment-14934414 ] Jing Zhao commented on HDFS-9053: - Also, since this is a critical part of the code, maybe we should invite more to review. Do you also want to take a look at the patch, [~szetszwo] and [~kihwal]? > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete/search, the time > complexity is O(log n), but insertion/deleting causes re-allocations and > copies of big arrays, so the operations are costly. For example, if the > children grow to 1M size, the ArrayList will resize to > 1M capacity, so need > > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS > cluster where namenode heap memory is already highly used. I recap the 3 > main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14934412#comment-14934412 ] Jing Zhao commented on HDFS-9053: - Some further comments: # In {{INodeDirectory#replaceChild}}, can we directly call {{addOrReplace}} instead of calling {{get}} first? {code} final INode existing = children.get(newChild.getLocalNameBytes()); .. oldChild = existing; .. children.addOrReplace(newChild); {code} # Do you think we can avoid the following code? Maybe we can add the EK type to the ReadOnlyCollection/ReadOnlyList level? {code} public Iterator iterator(EK k) { final byte[] name = (byte[])k; {code} # {{DirectoryWithSnapshotFeature#getChildrenList#iterator(EK)}} forgot to increase {{pos}}? Maybe also add a new test for this (e.g., set a small ls limit and list a snapshot of a directory)? {code} public INode next() { if (pos >= childrenSize) { throw new NoSuchElementException(); } return children.get(pos); } {code} # In {{getListing}}, instead of continuing the iteration, can we just call {{size()}} to calculate the number of the remaining items? {code} while (i.hasNext()) { INode cur = i.next(); if (!(locationBudget > 0 && listingCnt < fsd.getLsLimit())) { remaining++; continue; } {code} > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete/search, the time > complexity is O(log n), but insertion/deleting causes re-allocations and > copies of big arrays, so the operations are costly. For example, if the > children grow to 1M size, the ArrayList will resize to > 1M capacity, so need > > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS > cluster where namenode heap memory is already highly used. I recap the 3 > main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14933696#comment-14933696 ] Jing Zhao commented on HDFS-9053: - Yeah, sorry for the delay. I'm currently reviewing the remaining part. Will post comments later. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete/search, the time > complexity is O(log n), but insertion/deleting causes re-allocations and > copies of big arrays, so the operations are costly. For example, if the > children grow to 1M size, the ArrayList will resize to > 1M capacity, so need > > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS > cluster where namenode heap memory is already highly used. I recap the 3 > main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14910048#comment-14910048 ] Yi Liu commented on HDFS-9053: -- Hi [~jingzhao], could you help to do further reviews? Thanks a lot. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete/search, the time > complexity is O(log n), but insertion/deleting causes re-allocations and > copies of big arrays, so the operations are costly. For example, if the > children grow to 1M size, the ArrayList will resize to > 1M capacity, so need > > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS > cluster where namenode heap memory is already highly used. I recap the 3 > main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14904106#comment-14904106 ] Hadoop QA commented on HDFS-9053: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 19m 45s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 7 new or modified test files. | | {color:green}+1{color} | javac | 8m 3s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 13s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 24s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 2m 2s | There were no new checkstyle issues. | | {color:red}-1{color} | whitespace | 0m 8s | The patch has 2 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 39s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 4m 25s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | common tests | 23m 56s | Tests passed in hadoop-common. | | {color:red}-1{color} | hdfs tests | 195m 15s | Tests failed in hadoop-hdfs. | | | | 266m 28s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.hdfs.server.blockmanagement.TestBlockManager | | | hadoop.hdfs.TestRollingUpgrade | | | hadoop.hdfs.server.blockmanagement.TestUnderReplicatedBlocks | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12761615/HDFS-9053.002.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / cc2b473 | | whitespace | https://builds.apache.org/job/PreCommit-HDFS-Build/12615/artifact/patchprocess/whitespace.txt | | hadoop-common test log | https://builds.apache.org/job/PreCommit-HDFS-Build/12615/artifact/patchprocess/testrun_hadoop-common.txt | | hadoop-hdfs test log | https://builds.apache.org/job/PreCommit-HDFS-Build/12615/artifact/patchprocess/testrun_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/12615/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf909.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/12615/console | This message was automatically generated. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete/search, the time > complexity is O(log n), but insertion/deleting causes re-allocations and > copies of big arrays, so the operations are costly. For example, if the > children grow to 1M size, the ArrayList will resize to > 1M capacity, so need > > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS > cluster where namenode heap memory is already highly used. I recap the 3 > main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more:
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14903725#comment-14903725 ] Yi Liu commented on HDFS-9053: -- The Jenkins has some issue and the test failures are not related. The reason is: having multiple maven invocations going on at once sharing the same .m2 directory on the same machine, this patch includes new class file in Hadoop-common, when some other maven invocations run after this one on the same machine, "mvn install" of test will replace the hadoop-common jar in .m2 directory, so the test failures in Jenkins show NoClassDefFoundError for the new added class in hadoop-common. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete/search, the time > complexity is O(log n), but insertion/deleting causes re-allocations and > copies of big arrays, so the operations are costly. For example, if the > children grow to 1M size, the ArrayList will resize to > 1M capacity, so need > > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS > cluster where namenode heap memory is already highly used. I recap the 3 > main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902728#comment-14902728 ] Hadoop QA commented on HDFS-9053: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 19m 50s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 7 new or modified test files. | | {color:green}+1{color} | javac | 7m 56s | There were no new javac warning messages. | | {color:green}+1{color} | javadoc | 10m 7s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 25s | The applied patch does not increase the total number of release audit warnings. | | {color:green}+1{color} | checkstyle | 2m 4s | There were no new checkstyle issues. | | {color:red}-1{color} | whitespace | 0m 8s | The patch has 2 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 37s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 34s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 4m 24s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | common tests | 23m 9s | Tests passed in hadoop-common. | | {color:red}-1{color} | hdfs tests | 159m 26s | Tests failed in hadoop-hdfs. | | | | 229m 45s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.cli.TestXAttrCLI | | | hadoop.fs.viewfs.TestViewFsWithXAttrs | | | hadoop.fs.TestSWebHdfsFileContextMainOperations | | | hadoop.cli.TestAclCLI | | | hadoop.fs.TestHDFSFileContextMainOperations | | | hadoop.tracing.TestTracing | | | hadoop.security.TestRefreshUserMappings | | | hadoop.security.TestPermissionSymlinks | | | hadoop.fs.TestSymlinkHdfsFileContext | | | hadoop.fs.TestUrlStreamHandler | | | hadoop.fs.viewfs.TestViewFileSystemAtHdfsRoot | | | hadoop.fs.viewfs.TestViewFileSystemWithXAttrs | | | hadoop.fs.TestSymlinkHdfsFileSystem | | | hadoop.tools.TestJMXGet | | | hadoop.fs.viewfs.TestViewFsDefaultValue | | | hadoop.cli.TestHDFSCLI | | | hadoop.security.TestPermission | | | hadoop.fs.viewfs.TestViewFsHdfs | | | hadoop.cli.TestDeleteCLI | | | hadoop.fs.TestFcHdfsCreateMkdir | | | hadoop.fs.viewfs.TestViewFsWithAcls | | | hadoop.TestGenericRefresh | | | hadoop.fs.viewfs.TestViewFsAtHdfsRoot | | | hadoop.tracing.TestTracingShortCircuitLocalRead | | | hadoop.fs.TestFcHdfsPermission | | | hadoop.hdfs.web.TestWebHDFSOAuth2 | | | hadoop.tracing.TestTraceAdmin | | | hadoop.cli.TestCacheAdminCLI | | | hadoop.TestRefreshCallQueue | | | hadoop.fs.TestEnhancedByteBufferAccess | | | hadoop.cli.TestCryptoAdminCLI | | | hadoop.fs.viewfs.TestViewFileSystemHdfs | | | hadoop.fs.viewfs.TestViewFileSystemWithAcls | | | hadoop.fs.TestFcHdfsSetUMask | | | hadoop.fs.viewfs.TestViewFsFileStatusHdfs | | | hadoop.fs.shell.TestHdfsTextCommand | | Timed out tests | org.apache.hadoop.fs.contract.hdfs.TestHDFSContractDelete | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12761615/HDFS-9053.002.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 10ab7d5 | | whitespace | https://builds.apache.org/job/PreCommit-HDFS-Build/12589/artifact/patchprocess/whitespace.txt | | hadoop-common test log | https://builds.apache.org/job/PreCommit-HDFS-Build/12589/artifact/patchprocess/testrun_hadoop-common.txt | | hadoop-hdfs test log | https://builds.apache.org/job/PreCommit-HDFS-Build/12589/artifact/patchprocess/testrun_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/12589/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf900.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/12589/console | This message was automatically generated. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the child
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14902232#comment-14902232 ] Yi Liu commented on HDFS-9053: -- Update the patch to address the comments, also cleanup the javac/checkstyle/whitespace. {quote} 4. Optional: in insertElement maybe we can copy elements only once if we need to expand the array. {quote} We still need to have two copy even if we handle the insertElement and expand together, and the steps are: 1) allocate new heap memory, 2) copy the elements before insertion point to the new memory, 3) set inserted element and copy the elements after insertion point to the new memory. It also makes logic a bit complex. The current insertElement is the same behavior as insertion of ArrayList. So maybe we can keep current behavior, how do you think? Thanks Jing. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete/search, the time > complexity is O(log n), but insertion/deleting causes re-allocations and > copies of big arrays, so the operations are costly. For example, if the > children grow to 1M size, the ArrayList will resize to > 1M capacity, so need > > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS > cluster where namenode heap memory is already highly used. I recap the 3 > main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901808#comment-14901808 ] Yi Liu commented on HDFS-9053: -- Thanks a lot for your review and spend lots of time on this, Jing! I will update the B-Tree part to address your comments later. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete/search, the time > complexity is O(log n), but insertion/deleting causes re-allocations and > copies of big arrays, so the operations are costly. For example, if the > children grow to 1M size, the ArrayList will resize to > 1M capacity, so need > > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS > cluster where namenode heap memory is already highly used. I recap the 3 > main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901761#comment-14901761 ] Jing Zhao commented on HDFS-9053: - Thanks for the great work, Yi! So far I just reviewed the B-Tree implementation part and it looks good to me. Just some minor comments: # "static" can be removed {code} public static interface Element extends Comparable { K getKey(); } {code} # The parameter is never used. {code} Node(boolean allocateMaxElements) { elements = new Object[maxElements()]; } {code} # It may be helpful to add some more Preconditions/assert check to verify the parameter and internal state. For example, some verification about the index i in the following code. {code} SplitResult split(int i) { E e = (E)elements[i]; Node next = new Node(true); {code} # Optional: in insertElement maybe we can copy elements only once if we need to expand the array. # Rename {{put}} to {{addOrReplace}} to make its semantic more clear? # Need to update the javadoc of {{removeElement}} and {{removeChild}}. # {{SplitResult#element}} and {{SplitResult#node}} can be declared as final. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete/search, the time > complexity is O(log n), but insertion/deleting causes re-allocations and > copies of big arrays, so the operations are costly. For example, if the > children grow to 1M size, the ArrayList will resize to > 1M capacity, so need > > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS > cluster where namenode heap memory is already highly used. I recap the 3 > main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14802729#comment-14802729 ] Yi Liu commented on HDFS-9053: -- The failed one test is not related to this patch. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete/search, the time > complexity is O(log n), but insertion/deleting causes re-allocations and > copies of big arrays, so the operations are costly. For example, if the > children grow to 1M size, the ArrayList will resize to > 1M capacity, so need > > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS > cluster where namenode heap memory is already highly used. I recap the 3 > main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14791829#comment-14791829 ] Hadoop QA commented on HDFS-9053: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 18m 44s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 7 new or modified test files. | | {color:red}-1{color} | javac | 7m 35s | The applied patch generated 28 additional warning messages. | | {color:green}+1{color} | javadoc | 9m 50s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 45s | The applied patch generated 16 new checkstyle issues (total was 0, now 16). | | {color:red}-1{color} | whitespace | 0m 11s | The patch has 7 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 35s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 31s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 4m 15s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | common tests | 22m 29s | Tests passed in hadoop-common. | | {color:red}-1{color} | hdfs tests | 160m 56s | Tests failed in hadoop-hdfs. | | | | 228m 34s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.hdfs.web.TestWebHDFSOAuth2 | \\ \\ || Subsystem || Report/Notes || | Patch URL | http://issues.apache.org/jira/secure/attachment/12756270/HDFS-9053.001.patch | | Optional Tests | javadoc javac unit findbugs checkstyle | | git revision | trunk / 0832b38 | | javac | https://builds.apache.org/job/PreCommit-HDFS-Build/12501/artifact/patchprocess/diffJavacWarnings.txt | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/12501/artifact/patchprocess/diffcheckstylehadoop-common.txt | | whitespace | https://builds.apache.org/job/PreCommit-HDFS-Build/12501/artifact/patchprocess/whitespace.txt | | hadoop-common test log | https://builds.apache.org/job/PreCommit-HDFS-Build/12501/artifact/patchprocess/testrun_hadoop-common.txt | | hadoop-hdfs test log | https://builds.apache.org/job/PreCommit-HDFS-Build/12501/artifact/patchprocess/testrun_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/12501/testReport/ | | Java | 1.7.0_55 | | uname | Linux asf906.gq1.ygridcore.net 3.13.0-36-lowlatency #63-Ubuntu SMP PREEMPT Wed Sep 3 21:56:12 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/12501/console | This message was automatically generated. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete/search, the time > complexity is O(log n), but insertion/deleting causes re-allocations and > copies of big arrays, so the operations are costly. For example, if the > children grow to 1M size, the ArrayList will resize to > 1M capacity, so need > > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS > cluster where namenode heap memory is already highly used. I recap the 3 > main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14791529#comment-14791529 ] Hadoop QA commented on HDFS-9053: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:red}-1{color} | pre-patch | 17m 54s | Findbugs (version ) appears to be broken on trunk. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 7 new or modified test files. | | {color:red}-1{color} | javac | 7m 59s | The applied patch generated 28 additional warning messages. | | {color:green}+1{color} | javadoc | 10m 9s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 23s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 30s | The applied patch generated 16 new checkstyle issues (total was 0, now 16). | | {color:red}-1{color} | whitespace | 0m 8s | The patch has 7 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 37s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 32s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 4m 22s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:red}-1{color} | common tests | 22m 23s | Tests failed in hadoop-common. | | {color:red}-1{color} | hdfs tests | 43m 47s | Tests failed in hadoop-hdfs. | | | | 111m 4s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.net.TestClusterTopology | | | hadoop.hdfs.TestFileStatus | | | hadoop.hdfs.server.balancer.TestBalancerWithHANameNodes | | | hadoop.hdfs.server.namenode.TestINodeFile | | | hadoop.fs.contract.hdfs.TestHDFSContractOpen | | | hadoop.hdfs.server.datanode.TestFsDatasetCache | | | hadoop.hdfs.server.datanode.fsdataset.impl.TestDatanodeRestart | | | hadoop.hdfs.server.namenode.ha.TestHASafeMode | | | hadoop.hdfs.TestEncryptionZonesWithHA | | | hadoop.fs.contract.hdfs.TestHDFSContractMkdir | | | hadoop.hdfs.server.namenode.TestNameNodeXAttr | | | hadoop.hdfs.server.namenode.ha.TestBootstrapStandby | | | hadoop.hdfs.shortcircuit.TestShortCircuitCache | | | hadoop.hdfs.TestDecommission | | | hadoop.hdfs.server.namenode.TestFSEditLogLoader | | | hadoop.hdfs.server.namenode.ha.TestDelegationTokensWithHA | | | hadoop.hdfs.server.balancer.TestBalancerWithMultipleNameNodes | | | hadoop.hdfs.server.blockmanagement.TestNameNodePrunesMissingStorages | | | hadoop.hdfs.server.datanode.TestCachingStrategy | | | hadoop.hdfs.server.namenode.TestFileJournalManager | | | hadoop.hdfs.server.datanode.TestDirectoryScanner | | | hadoop.cli.TestXAttrCLI | | | hadoop.hdfs.server.namenode.TestDeleteRace | | | hadoop.hdfs.server.namenode.TestParallelImageWrite | | | hadoop.hdfs.server.namenode.TestNameNodeRespectsBindHostKeys | | | hadoop.hdfs.server.namenode.TestNNStorageRetentionFunctional | | | hadoop.hdfs.server.namenode.TestSaveNamespace | | | hadoop.hdfs.TestDFSRename | | | hadoop.hdfs.util.TestDiff | | | hadoop.hdfs.server.datanode.TestDataNodeFSDataSetSink | | | hadoop.hdfs.server.namenode.snapshot.TestSnapshotManager | | | hadoop.hdfs.server.namenode.TestFsck | | | hadoop.hdfs.server.namenode.ha.TestHarFileSystemWithHA | | | hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyPersistReplicaPlacement | | | hadoop.hdfs.server.datanode.TestDataNodeVolumeFailureToleration | | | hadoop.hdfs.server.datanode.TestDeleteBlockPool | | | hadoop.hdfs.TestRemoteBlockReader2 | | | hadoop.hdfs.server.namenode.TestStorageRestore | | | hadoop.hdfs.server.namenode.TestFileLimit | | | hadoop.hdfs.server.blockmanagement.TestNodeCount | | | hadoop.fs.contract.hdfs.TestHDFSContractSetTimes | | | hadoop.hdfs.server.namenode.snapshot.TestCheckpointsWithSnapshots | | | hadoop.hdfs.server.namenode.TestFSPermissionChecker | | | hadoop.hdfs.server.namenode.TestSecureNameNode | | | hadoop.hdfs.server.namenode.TestFileContextAcl | | | hadoop.hdfs.server.datanode.TestDataXceiverLazyPersistHint | | | hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA | | | hadoop.hdfs.TestDataTransferProtocol | | | hadoop.fs.viewfs.TestViewFsWithXAttrs | | | hadoop.hdfs.server.blockmanagement.TestDatanodeManager | | | hadoop.hdfs.server.datanode.TestDiskError | | | hadoop.hdfs.server.namenode.TestMalformedURLs | | | hadoop.hdfs.TestReadWhileWriting | | | hadoop.fs.TestSWebHdfsFileContextMainOperations | | | hadoop.hdfs.TestIsMethodSupported | | | hadoop.hdfs.TestParallelShortCircuitReadNoChecksum | | | hadoop.hdfs.server.blockmanagement.TestAvailableSpaceBlockPlacementPolicy | | | hadoop.hdfs.TestFileCr
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14791373#comment-14791373 ] Yi Liu commented on HDFS-9053: -- Seems there is issue for Jenkins, re-trigger... > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete/search, the time > complexity is O(log n), but insertion/deleting causes re-allocations and > copies of big arrays, so the operations are costly. For example, if the > children grow to 1M size, the ArrayList will resize to > 1M capacity, so need > > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS > cluster where namenode heap memory is already highly used. I recap the 3 > main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790677#comment-14790677 ] Hadoop QA commented on HDFS-9053: - \\ \\ | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | pre-patch | 19m 37s | Pre-patch trunk compilation is healthy. | | {color:green}+1{color} | @author | 0m 0s | The patch does not contain any @author tags. | | {color:green}+1{color} | tests included | 0m 0s | The patch appears to include 7 new or modified test files. | | {color:red}-1{color} | javac | 7m 52s | The applied patch generated 28 additional warning messages. | | {color:green}+1{color} | javadoc | 10m 11s | There were no new javadoc warning messages. | | {color:green}+1{color} | release audit | 0m 24s | The applied patch does not increase the total number of release audit warnings. | | {color:red}-1{color} | checkstyle | 1m 52s | The applied patch generated 16 new checkstyle issues (total was 0, now 16). | | {color:red}-1{color} | whitespace | 0m 12s | The patch has 7 line(s) that end in whitespace. Use git apply --whitespace=fix. | | {color:green}+1{color} | install | 1m 39s | mvn install still works. | | {color:green}+1{color} | eclipse:eclipse | 0m 33s | The patch built with eclipse:eclipse. | | {color:green}+1{color} | findbugs | 4m 25s | The patch does not introduce any new Findbugs (version 3.0.0) warnings. | | {color:green}+1{color} | common tests | 23m 21s | Tests passed in hadoop-common. | | {color:red}-1{color} | hdfs tests | 32m 46s | Tests failed in hadoop-hdfs. | | | | 103m 12s | | \\ \\ || Reason || Tests || | Failed unit tests | hadoop.hdfs.server.balancer.TestBalancerWithEncryptedTransfer | | | hadoop.hdfs.TestHDFSFileSystemContract | | | hadoop.hdfs.qjournal.client.TestQuorumJournalManager | | | hadoop.hdfs.TestEncryptedTransfer | | | hadoop.hdfs.server.namenode.snapshot.TestSnapshotDiffReport | | | hadoop.hdfs.server.namenode.TestGetBlockLocations | | | hadoop.hdfs.TestBlockStoragePolicy | | | hadoop.hdfs.TestRollingUpgrade | | | hadoop.hdfs.web.TestWebHDFSForHA | | | hadoop.hdfs.server.balancer.TestBalancerWithMultipleNameNodes | | | hadoop.hdfs.server.namenode.metrics.TestNameNodeMetrics | | | hadoop.hdfs.server.namenode.ha.TestStandbyCheckpoints | | | hadoop.fs.viewfs.TestViewFileSystemWithAcls | | | hadoop.hdfs.server.datanode.TestDataNodeECN | | | hadoop.hdfs.web.TestWebHdfsUrl | | | hadoop.hdfs.TestClientProtocolForPipelineRecovery | | | hadoop.TestGenericRefresh | | | hadoop.hdfs.server.namenode.TestFSEditLogLoader | | | hadoop.hdfs.server.namenode.TestEditLogJournalFailures | | | hadoop.fs.contract.hdfs.TestHDFSContractSeek | | | hadoop.hdfs.TestPread | | | hadoop.hdfs.server.namenode.snapshot.TestXAttrWithSnapshot | | | hadoop.hdfs.tools.TestDFSZKFailoverController | | | hadoop.fs.viewfs.TestViewFsFileStatusHdfs | | | hadoop.hdfs.TestWriteConfigurationToDFS | | | hadoop.hdfs.TestParallelShortCircuitLegacyRead | | | hadoop.hdfs.TestFileCreationClient | | | hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes | | | hadoop.hdfs.server.namenode.ha.TestInitializeSharedEdits | | | hadoop.hdfs.crypto.TestHdfsCryptoStreams | | | hadoop.hdfs.server.datanode.TestTransferRbw | | | hadoop.hdfs.TestWriteBlockGetsBlockLengthHint | | | hadoop.hdfs.server.namenode.TestAddBlock | | | hadoop.fs.TestSymlinkHdfsDisable | | | hadoop.hdfs.TestDecommission | | | hadoop.hdfs.tools.offlineImageViewer.TestOfflineImageViewerForContentSummary | | | hadoop.hdfs.server.namenode.snapshot.TestSnapshotMetrics | | | hadoop.hdfs.server.datanode.TestDataNodeMultipleRegistrations | | | hadoop.hdfs.server.namenode.TestFSImageWithAcl | | | hadoop.hdfs.server.mover.TestMover | | | hadoop.hdfs.TestDFSRemove | | | hadoop.cli.TestCacheAdminCLI | | | hadoop.hdfs.server.namenode.ha.TestLossyRetryInvocationHandler | | | hadoop.fs.loadGenerator.TestLoadGenerator | | | hadoop.hdfs.server.datanode.TestNNHandlesBlockReportPerStorage | | | hadoop.hdfs.server.namenode.snapshot.TestSnapshotReplication | | | hadoop.hdfs.server.datanode.fsdataset.impl.TestLazyWriter | | | hadoop.hdfs.TestFileAppend4 | | | hadoop.hdfs.server.namenode.TestEditLogAutoroll | | | hadoop.hdfs.qjournal.server.TestJournalNode | | | hadoop.hdfs.server.namenode.ha.TestDNFencing | | | hadoop.hdfs.TestRollingUpgradeDowngrade | | | hadoop.hdfs.server.namenode.snapshot.TestSnapshotFileLength | | | hadoop.hdfs.server.datanode.TestNNHandlesCombinedBlockReport | | | hadoop.security.TestPermission | | | hadoop.hdfs.TestDFSShell | | | hadoop.hdfs.server.datanode.TestRefreshNamenodes | | | hadoop.hdfs.server.namenode.TestDeleteRace | | | hadoop.hdfs.server.namenode.ha.TestDNFencingWithReplication | | | hadoop.hdfs.server.datanode.TestDataNodeRol
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14769053#comment-14769053 ] Yi Liu commented on HDFS-9053: -- Upload the patch and trigger Jenkins. The patch looks large, but it's not complex as its size looks. Many modifications are caused by {{INodeDirectory#getChildren}} returns {{ReadOnlyCollection}} instead of {{ReadOnlyList}}. The patch mainly includes following parts: # BTree implementation and related tests. # INodeDirectory and related modification to use BTree and tests for large directory. # Other modifications by INodeDirectory update, especially {{INodeDirectory#getChildren}} returns {{ReadOnlyCollection}} instead of {{ReadOnlyList}}. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete/search, the time > complexity is O(log n), but insertion/deleting causes re-allocations and > copies of big arrays, so the operations are costly. For example, if the > children grow to 1M size, the ArrayList will resize to > 1M capacity, so need > > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS > cluster where namenode heap memory is already highly used. I recap the 3 > main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14740842#comment-14740842 ] Yi Liu commented on HDFS-9053: -- To integrate B-Tree with INodeDirectory, it's natural besides directory exposes {{getChildrenList}} and returns {{ReadOnlyList}}, B-Tree implements Iterable, it's not a list. Actually most other places want to iterate all elements, and some places want to iterate starting from certain child. So my plan is add an new Iterable interface which extends {{java.lang.Iterable}} and allow creating an iterator starting from certain child. Then in directory, getChildren returns {{ReadOnlyIterator}}, and {{ReadOnlyList}} implements it the ReadOnlyIterator. For snapshot, it makes sense to keep current behavior, still use the {{ReadOnlyList}}. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete/search, the time > complexity is O(log n), but insertion/deleting causes re-allocations and > copies of big arrays, so the operations are costly. For example, if the > children grow to 1M size, the ArrayList will resize to > 1M capacity, so need > > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS > cluster where namenode heap memory is already highly used. I recap the 3 > main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14740814#comment-14740814 ] Yi Liu commented on HDFS-9053: -- *1. Performance* I choose 1024 as the B-Tree degree. For ArrayList in tests, the elements are in order, so for insertion/deletion/getting, we get the index using binary search first and do insertion/deletion/getting, this is the same behavior as in INodeDirectory. >From the performance data below, B-Tree is much more better on >insertion/deletion, and almost same for get, for iteration, it's a bit worse >(since it's not to iterate continuous memory), but the iterations are fast >enough, so we can ignore. -- *Insert (random)* (in milliseconds) ||Data Size||B-Tree||ArrayList|| |1K|4|4| |64K|71|613| |512K|494|38022| |1M|1365|151855| |2M|3584|688158| -- *Delete (random)* (in milliseconds) ||Data Size||B-Tree||ArrayList|| |1K|4|4| |64K|75|643| |512K|658|40463| |1M|1427|161492| |2M|3724|718581| -- *Get (random)* (in milliseconds) ||Data Size||B-Tree||ArrayList|| |1K|1|1| |64K|16|17| |512K|279|295| |1M|759|755| |2M|2055|2047| -- *Iteration* (in milliseconds) ||Data Size||B-Tree||ArrayList|| |1K|n/a|n/a| |64K|9|3| |512K|30|16| |1M|61|23| |2M|143|40| *2. Memory* As stated in the description, B-Tree is very good because it solves #1, #3 issues from memory aspect. B-Tree may have few object overhead and additional array to store references to sub-trees. The overhead is relatively very small for large directories; for small directories, the overhead is small itself. Furthermore it's directory, so few overhead is acceptable, besides, I already try best effort to reduce the overhead while implementing. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree).patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete/search, the time > complexity is O(log n), but insertion/deleting causes re-allocations and > copies of big arrays, so the operations are costly. For example, if the > children grow to 1M size, the ArrayList will resize to > 1M capacity, so need > > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS > cluster where namenode heap memory is already highly used. I recap the 3 > main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14740370#comment-14740370 ] Yi Liu commented on HDFS-9053: -- Current patch only includes my implementation of B-Tree, I will post the patch integration with {{INodeDirectory}} later. Code logic of B-Tree is more complex than other normal data structure, the most challenged part is remove/merge, insert/split, iteration. I refer to google implementation of B-Tree in Go language (https://github.com/google/btree), and borrow their merge, split logic. So the correctness of that part should be already verified in practice. I checked the B-Tree implementation carefully according to B-Tree definition, also did many tests to make sure B-Tree implementation correct. Later, I will give some performance data of insertion/deletion/searching comparing between B-Tree and ArrayList. The implementation in patch is low memory footprint, if the elements size is not large (less than the maximum degree of B-Tree node), the B-Tree will only have one root node and contains an array for the elements, and the children array is null. I use java array for the elements and children, and control the expand in the code myself to save the object overhead and can use fewer memory compared to use ArrayList. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree).patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete/search, the time > complexity is O(log n), but insertion/deleting causes re-allocations and > copies of big arrays, so the operations are costly. For example, if the > children grow to 1M size, the ArrayList will resize to > 1M capacity, so need > > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS > cluster where namenode heap memory is already highly used. I recap the 3 > main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)