[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14942004#comment-14942004 ]
Yi Liu edited comment on HDFS-9053 at 10/3/15 12:49 AM: -------------------------------------------------------- Thanks [~szetszwo], good comment, I ever considered it carefully too. I want to convince you to allow me only use B-Tree here: # Use the case you said, the #children is small and < 4K. *1)* If children is < 2K, then B-Tree only contains a root. As we counted before, the increased overhead is only 44 bytes which is really very small for a directory, a continuous block is 80 bytes memory (detail below), so we only increase about 1/2 continuous block for a directory in NN. *2)* If the children is > 2K and < 4K, here we use 4K as example, the B-Tree at most contains 3 branches: 1 root node, 3 leaf nodes. One leaf node increase about (40 bytes + 16 bytes elements array overhead) = 56 bytes, and 1 root node is (40 bytes + 16 bytes elements array overhead + 16 bytes children overhead + 3 children * 8) = 96 bytes, the b-tree itself is 40 bytes, and we need to subtract the ArrayList (40 bytes + 16 bytes elements array overhead) = 56 bytes, so we at most increase 56 * 3 + 96 + 40 - 56 = 248 bytes overhead, but ArrayList of 4K references to INode needs more than 4K * 8 = 32K memory, then we can get that the increased memory is only *0.75%* {noformat} org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoContiguous object internals: OFFSET SIZE TYPE DESCRIPTION VALUE 0 16 (object header) N/A 16 8 long Block.blockId N/A 24 8 long Block.numBytes N/A 32 8 long Block.generationStamp N/A 40 8 long BlockInfo.bcId N/A 48 2 short BlockInfo.replication N/A 50 6 (alignment/padding gap) N/A 56 8 LinkedElement BlockInfo.nextLinkedElement N/A 64 8 Object[] BlockInfo.triplets N/A 72 8 BlockUnderConstructionFeature BlockInfo.uc N/A Instance size: 80 bytes (estimated, the sample instance is not available) {noformat} # One advantage of B-Tree compared to ArrayList for small size children is: B-Tree can shrink. If the children of directory decreases from 4K to less than 2K, there are 2K * 8 = 16K memory wasteful if using ArrayList. # On the other hand, if we do the switch between ArrayList and B-Tree, we may need write a class to wrap the two data structures, then it still needs 16bytes object overhead + 8 bytes additional reference = 24 bytes. How do you say? Thanks, Nicholas. was (Author: hitliuyi): Thanks [~szetszwo], good comment, I ever considered it carefully too. I want to convince you to allow me only use B-Tree here: # Use the case you said, the #children is small and < 4K. *1)* If children is < 2K, then B-Tree only contains a root. As we counted before, the increased overhead is only 44 bytes which is really very small for a directory, a continuous block is 80 bytes memory (detail below), so we only increase about 1/2 continuous block for a directory in NN. *2)* If the children is > 2K and < 4K, here we use 4K as example, the B-Tree at most contains 3 branches: 1 root node, 3 leaf nodes. One leaf node increase about (40 bytes + 16 bytes elements array overhead) = 56 bytes, and 1 root node is (40 bytes + 16 bytes elements array overhead + 16 bytes children overhead + 3 children * 8) = 96 bytes, the b-tree itself is 40 bytes, and we need to subtract the ArrayList (40 bytes + 16 bytes elements array overhead) = 56 bytes, so we at most increase 56 * 3 + 96 + 40 - 56 = 248 bytes overhead, but ArrayList of 4K references to INode needs more than 4K * 8 = 32K memory, then we can get that the increased memory is only *0.75%* {noformat} org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoContiguous object internals: OFFSET SIZE TYPE DESCRIPTION VALUE 0 16 (object header) N/A 16 8 long Block.blockId N/A 24 8 long Block.numBytes N/A 32 8 long Block.generationStamp N/A 40 8 long BlockInfo.bcId N/A 48 2 short BlockInfo.replication N/A 50 6 (alignment/padding gap) N/A 56 8 LinkedElement BlockInfo.nextLinkedElement N/A 64 8 Object[] BlockInfo.triplets N/A 72 8 BlockUnderConstructionFeature BlockInfo.uc N/A Instance size: 80 bytes (estimated, the sample instance is not available) {noformat} # One advantage of B-Tree compared to ArrayList for small size children is: B-Tree can shrink. If the children of directory decreases from 4K to less than 2K, there are 2K * 8 = 16K memory wasteful if suing ArrayList. # On the other hand, if we do the switch between ArrayList and B-Tree, we may need write a class to wrap the two data structures, then it still needs 16bytes object overhead + 8 bytes additional reference = 24 bytes. How do you say? Thanks, Nicholas. > Support large directories efficiently using B-Tree > -------------------------------------------------- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode > Reporter: Yi Liu > Assignee: Yi Liu > Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)