[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968445#comment-14968445 ]
Yi Liu edited comment on HDFS-9053 at 10/22/15 3:34 AM: -------------------------------------------------------- Got it now. Thanks a lot [~jingzhao]! Let's recalculate again. For original btree implementation, it increases 44 bytes which is my estimation and ignore alignment. Now I have looked into the real memory layout, it actually increases (40 + 40) - 40 arraylist = 40 bytes. And I can do small improvement to remove {{degree}} variable 4 bytes + 4 bytes alignment/padding gap = 8 bytes. So finally if we don't let BTree extend Node, it increases *32 bytes* for a directly. 32 bytes memory increment for a directory is fine for me and was also same as my original thought. As in your example, if we have 1M directories, then we only increase heap size by 32 MB. I also respect Nicholas' comment, if we all think it's OK, I am happy to do this :). {noformat} org.apache.hadoop.util.BTree object internals: OFFSET SIZE TYPE DESCRIPTION VALUE 0 16 (object header) N/A 16 4 int BTree.degree N/A 20 4 int BTree.size N/A 24 4 int BTree.modCount N/A 28 4 (alignment/padding gap) N/A 32 8 Node BTree.root N/A Instance size: 40 bytes (estimated, the sample instance is not available) Space losses: 4 bytes internal + 0 bytes external = 4 bytes total {noformat} {noformat} org.apache.hadoop.util.BTree.Node object internals: OFFSET SIZE TYPE DESCRIPTION VALUE 0 16 (object header) N/A 16 4 int Node.elementsSize N/A 20 4 int Node.childrenSize N/A 24 8 Object[] Node.elements N/A 32 8 Object[] Node.children N/A Instance size: 40 bytes (estimated, the sample instance is not available) Space losses: 0 bytes internal + 0 bytes external = 0 bytes total {noformat} was (Author: hitliuyi): Got it now. Thanks a lot [~jingzhao]! Let's recalculate again. For original btree implementation, it increases 44 bytes which is my estimation and ignore alignment. Now I have looked into the real memory layout, it actually increases (40 + 40) - 40 arraylist = 40 bytes. And I can do small improvement to remove {{degree}} variable 4 bytes + 4 bytes alignment/padding gap = 8 bytes. So finally if we don't let BTree extend Node, it increases *32 bytes* for a directly. 32 bytes memory increment for a directory is fine for me and was my original thought. As in your example, if we have 1M directories, then we only increase heap size by 32 MB. I also respect Nicholas' comment, if we all think it's OK, I am happy to do this :). {noformat} org.apache.hadoop.util.BTree object internals: OFFSET SIZE TYPE DESCRIPTION VALUE 0 16 (object header) N/A 16 4 int BTree.degree N/A 20 4 int BTree.size N/A 24 4 int BTree.modCount N/A 28 4 (alignment/padding gap) N/A 32 8 Node BTree.root N/A Instance size: 40 bytes (estimated, the sample instance is not available) Space losses: 4 bytes internal + 0 bytes external = 4 bytes total {noformat} {noformat} org.apache.hadoop.util.BTree.Node object internals: OFFSET SIZE TYPE DESCRIPTION VALUE 0 16 (object header) N/A 16 4 int Node.elementsSize N/A 20 4 int Node.childrenSize N/A 24 8 Object[] Node.elements N/A 32 8 Object[] Node.children N/A Instance size: 40 bytes (estimated, the sample instance is not available) Space losses: 0 bytes internal + 0 bytes external = 0 bytes total {noformat} > Support large directories efficiently using B-Tree > -------------------------------------------------- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode > Reporter: Yi Liu > Assignee: Yi Liu > Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch, > HDFS-9053.007.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)