[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968277#comment-14968277 ] Yi Liu edited comment on HDFS-9053 at 10/22/15 12:51 AM: - Thanks [~jingzhao]! {quote} My concern is if the number of directories is limited, then maybe increasing 44 bytes per directory is fine. E.g., for a big cluster with >100M files/directories, if we only have 1M directories, then we only increase the heap size by 44 MB. Compared with the total heap size this may be just <0.1% increase. {quote} Each directory increasing 44 bytes is the old implementation. In the latest patch, each directory only increase *8* bytes. So it's quite small for directory. The new approach also just uses btree only, it's NOT to switch between ArrayList and btree, please look at the {{007}} patch. I know some users have more than 500M files, and some directory contains quite lots of files, usually this kind of large directory is accessed most frequently, just as "barrel principle", this becomes the bottleneck of NN. was (Author: hitliuyi): Thanks [~jingzhao]! {quote} My concern is if the number of directories is limited, then maybe increasing 44 bytes per directory is fine. E.g., for a big cluster with >100M files/directories, if we only have 1M directories, then we only increase the heap size by 44 MB. Compared with the total heap size this may be just <0.1% increase. {quote} Each directory increasing 44 bytes is the old implementation. In the latest patch, each directory only increase *8* bytes. So it's quite small for directory. I know some users have more than 500M files, and some directory contains quite lots of files, usually this kind of large directory is accessed most frequently, just as "barrel principle", this becomes the bottleneck of NN. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch, > HDFS-9053.007.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968445#comment-14968445 ] Yi Liu edited comment on HDFS-9053 at 10/22/15 3:34 AM: Got it now. Thanks a lot [~jingzhao]! Let's recalculate again. For original btree implementation, it increases 44 bytes which is my estimation and ignore alignment. Now I have looked into the real memory layout, it actually increases (40 + 40) - 40 arraylist = 40 bytes. And I can do small improvement to remove {{degree}} variable 4 bytes + 4 bytes alignment/padding gap = 8 bytes. So finally if we don't let BTree extend Node, it increases *32 bytes* for a directly. 32 bytes memory increment for a directory is fine for me and was also same as my original thought. As in your example, if we have 1M directories, then we only increase heap size by 32 MB. I also respect Nicholas' comment, if we all think it's OK, I am happy to do this :). {noformat} org.apache.hadoop.util.BTree object internals: OFFSET SIZE TYPE DESCRIPTIONVALUE 016 (object header)N/A 16 4 int BTree.degree N/A 20 4 int BTree.size N/A 24 4 int BTree.modCount N/A 28 4 (alignment/padding gap)N/A 32 8 Node BTree.root N/A Instance size: 40 bytes (estimated, the sample instance is not available) Space losses: 4 bytes internal + 0 bytes external = 4 bytes total {noformat} {noformat} org.apache.hadoop.util.BTree.Node object internals: OFFSET SIZE TYPE DESCRIPTIONVALUE 016 (object header)N/A 16 4 int Node.elementsSize N/A 20 4 int Node.childrenSize N/A 24 8 Object[] Node.elements N/A 32 8 Object[] Node.children N/A Instance size: 40 bytes (estimated, the sample instance is not available) Space losses: 0 bytes internal + 0 bytes external = 0 bytes total {noformat} was (Author: hitliuyi): Got it now. Thanks a lot [~jingzhao]! Let's recalculate again. For original btree implementation, it increases 44 bytes which is my estimation and ignore alignment. Now I have looked into the real memory layout, it actually increases (40 + 40) - 40 arraylist = 40 bytes. And I can do small improvement to remove {{degree}} variable 4 bytes + 4 bytes alignment/padding gap = 8 bytes. So finally if we don't let BTree extend Node, it increases *32 bytes* for a directly. 32 bytes memory increment for a directory is fine for me and was my original thought. As in your example, if we have 1M directories, then we only increase heap size by 32 MB. I also respect Nicholas' comment, if we all think it's OK, I am happy to do this :). {noformat} org.apache.hadoop.util.BTree object internals: OFFSET SIZE TYPE DESCRIPTIONVALUE 016 (object header)N/A 16 4 int BTree.degree N/A 20 4 int BTree.size N/A 24 4 int BTree.modCount N/A 28 4 (alignment/padding gap)N/A 32 8 Node BTree.root N/A Instance size: 40 bytes (estimated, the sample instance is not available) Space losses: 4 bytes internal + 0 bytes external = 4 bytes total {noformat} {noformat} org.apache.hadoop.util.BTree.Node object internals: OFFSET SIZE TYPE DESCRIPTIONVALUE 016 (object header)N/A 16 4 int Node.elementsSize N/A 20 4 int Node.childrenSize N/A 24 8 Object[] Node.elements N/A 32 8 Object[] Node.children N/A Instance size: 40 bytes (estimated, the sample instance is not available) Space losses: 0 bytes internal + 0 bytes external = 0 bytes total {noformat} > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch, > HDFS-9053.007.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete,
[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959955#comment-14959955 ] Yi Liu edited comment on HDFS-9053 at 10/16/15 12:56 AM: - Thanks [~szetszwo] for the comments. {quote} >> 24 8 Object[] Node.elements N/A >> 32 8 Object[] Node.children N/A It only counts the reference but array objects are not counted. So the BTree overhead is still a lot more than ArrayList. {quote} For small elements size (assume # < max degree which is 2047), the {{children}} is null reference, so there is no array object of {{children}} here, just 8 bytes for null reference. And for {{elements}}, {{ArrayList}} also has it. So as described above, {{BTree}} increases 8 bytes compared with {{ArrayList}} for small size elements. Do I miss something? was (Author: hitliuyi): Thanks [~szetszwo] for the comments. {quote} >> 24 8 Object[] Node.elements N/A >> 32 8 Object[] Node.children N/A It only counts the reference but array objects are not counted. So the BTree overhead is still a lot more than ArrayList. {quote} For small elements size (assume # < max degree which is 2047), the {{children}} is null reference, so there is no array object of {{children}} here, and for {{elements}}, {{ArrayList}} also has it. So as described above, {{BTree}} increases 8 bytes compared with {{ArrayList}} for small size elements. Do I miss something? > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949847#comment-14949847 ] Yi Liu edited comment on HDFS-9053 at 10/9/15 8:49 AM: --- Hi [~szetszwo], thanks for your comments. I will do as your suggestion: use an array list if the children size is small (<= 4K), otherwise use B-Tree. But there may be some differences for the implementation details. {quote} The field in INodeDirectory is List children which may refer to either an ArrayList or a BTreeList. We may replace the list at runtime {quote} It's hard to use a List, since when we use ArrayList, we do searching before index/delete and access through index, but BTree is in-order and we don't access through index. So I think we do it in following way: # actually I made an initial patch of switching between array list and b-tree few days ago. The logic of ArrayList is not complicated, so I implement it in a new class and support shrinkable, in the new class, I control the array and expanding, also the new class keeps reference to a b-tree, if the elements size becomes large, it switches to use the b-tree. The new class is an in-order data structure, not like ArrayList which we need to search before operating. In INodeDirectory, we just need to use the new class, another reason of I implement a new data structure class and don't do the switching in INodeDirectory is: we should have a {{SWITCH_THRESHOLD}} for switching from array list to b-tree, and need a low water mark to switch back, they should not be the same value, otherwise, the switching becomes frequent at some point, so I don't want to expose too many internal logic of switching in INodeDirectly. I will give the memory usage after posting a new patch. {quote} I am also worry about the potiental bugs and the risk. If there is a bug in B-Tree, it is possible to lose one or more sub trees and, as a result, lose a lot of data. ArrayList is already well-tested. Anyway, we need more tests for the B-Tree, especially some long running random tests. {quote} Sure, I will add more tests for it, I have added many tests including some long running. I agree with that a bug-free data structure implementation is not easy, we should be careful and test the new implementations extensively :) was (Author: hitliuyi): Hi [~szetszwo], thanks for your comments. I will do as your suggestion: use an array list if the children size is small (<= 4K), otherwise use B-Tree. But there may be some differences for the implementation details. {quote} The field in INodeDirectory is List children which may refer to either an ArrayList or a BTreeList. We may replace the list at runtime {quote} It's hard to use a List, since when we use ArrayList, we do searching before index/delete and access through index, but BTree is in-order and we don't access through index. So I think we do it in following way: # actually I made an initial patch of switching between array list and b-tree few days ago. The logic of ArrayList is not complicated, so I implement it in a new class and support shrinkable, in the new class, I control the array and expanding, also the new class keeps reference to a b-tree, if the elements size becomes large, it switches to use the b-tree. The new class is an in-order data structure, not like ArrayList which we need to search before operating. In INodeDirectory, we just need to use the new class, another reason of I implement a new data structure class and don't do the switching in INodeDirectory is: we should have a {{SWITCH_THRESHOLD}} for switching from array list to b-tree, and need a low water mark to switch back, they should not be the same value, otherwise, the switching becomes frequent at some point, so I don't want to expose too many internal logic of switching in INodeDirectly. But the final memory usage of the new data structure is the same as ArrayList, even it has a reference to b-tree, and it supports Shrinkable I will give the memory usage after posting a new patch. {quote} I am also worry about the potiental bugs and the risk. If there is a bug in B-Tree, it is possible to lose one or more sub trees and, as a result, lose a lot of data. ArrayList is already well-tested. Anyway, we need more tests for the B-Tree, especially some long running random tests. {quote} Sure, I will add more tests for it, I have added many tests including some long running. I agree with that a bug-free data structure implementation is not easy, we should be careful and test the new implementations extensively :) > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >
[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14950498#comment-14950498 ] Yi Liu edited comment on HDFS-9053 at 10/9/15 2:51 PM: --- I find a good approach to improve B-Tree memory overhead to make it only increase *8 bytes* memory usage comparing with using {{ArrayList}} for small elements size. So we don't need to use ArrayList when #children is small (< 4K), and we can always use the {{BTree}}. The main idea is to let {{BTree}} extend the BTree Node, then we don't need a separate root node, since {{BTree}} itself is the root. {noformat} java.util.ArrayList object internals: OFFSET SIZE TYPE DESCRIPTIONVALUE 016 (object header)N/A 16 4 int AbstractList.modCount N/A 20 4 (alignment/padding gap)N/A 24 4 int ArrayList.size N/A 28 4 (alignment/padding gap)N/A 32 8 Object[] ArrayList.elementData N/A Instance size: 40 bytes (estimated, the sample instance is not available) {noformat} {noformat} org.apache.hadoop.util.btree.BTree object internals: OFFSET SIZE TYPE DESCRIPTIONVALUE 016 (object header)N/A 16 4 int Node.elementsSize N/A 20 4 int Node.childrenSize N/A 24 8 Object[] Node.elements N/A 32 8 Object[] Node.children N/A 40 4 int BTree.size N/A 44 4 int BTree.modCount N/A Instance size: 48 bytes (estimated, the sample instance is not available) {noformat} We can see {{BTree}} only increases *8 bytes* comparing with {{ArrayList}} for a {{INodeDirectory}}. [~jingzhao], [~szetszwo], please look at the new patch {{006}}. was (Author: hitliuyi): I find a good approach to improve B-Tree overhead to make it only increase *8 bytes* memory usage comparing with using {{ArrayList}} for small elements size. So we don't need to use ArrayList when #children is small (< 4K), and we can always use the {{BTree}}. The main idea is to let {{BTree}} extend the BTree Node, then we don't need a separate root node, since {{BTree}} itself is the root. {noformat} java.util.ArrayList object internals: OFFSET SIZE TYPE DESCRIPTIONVALUE 016 (object header)N/A 16 4 int AbstractList.modCount N/A 20 4 (alignment/padding gap)N/A 24 4 int ArrayList.size N/A 28 4 (alignment/padding gap)N/A 32 8 Object[] ArrayList.elementData N/A Instance size: 40 bytes (estimated, the sample instance is not available) {noformat} {noformat} org.apache.hadoop.util.btree.BTree object internals: OFFSET SIZE TYPE DESCRIPTIONVALUE 016 (object header)N/A 16 4 int Node.elementsSize N/A 20 4 int Node.childrenSize N/A 24 8 Object[] Node.elements N/A 32 8 Object[] Node.children N/A 40 4 int BTree.size N/A 44 4 int BTree.modCount N/A Instance size: 48 bytes (estimated, the sample instance is not available) {noformat} We can see {{BTree}} only increases *8 bytes* comparing with {{ArrayList}} for a {{INodeDirectory}}. [~jingzhao], [~szetszwo], please look at the new patch {{006}}. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch, > HDFS-9053.004.patch, HDFS-9053.005.patch, HDFS-9053.006.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in
[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14949847#comment-14949847 ] Yi Liu edited comment on HDFS-9053 at 10/9/15 3:51 AM: --- Hi [~szetszwo], thanks for your comments. I will do as your suggestion: use an array list if the children size is small (<= 4K), otherwise use B-Tree. But there may be some differences for the implementation details. {quote} The field in INodeDirectory is List children which may refer to either an ArrayList or a BTreeList. We may replace the list at runtime {quote} It's hard to use a List, since when we use ArrayList, we do searching before index/delete and access through index, but BTree is in-order and we don't access through index. So I think we do it in following way: # actually I made an initial patch of switching between array list and b-tree few days ago. The logic of ArrayList is not complicated, so I implement it in a new class and support shrinkable, in the new class, I control the array and expanding, also the new class keeps reference to a b-tree, if the elements size becomes large, it switches to use the b-tree. The new class is an in-order data structure, not like ArrayList which we need to search before operating. In INodeDirectory, we just need to use the new class, another reason of I implement a new data structure class and don't do the switching in INodeDirectory is: we should have a {{SWITCH_THRESHOLD}} for switching from array list to b-tree, and need a low water mark to switch back, they should not be the same value, otherwise, the switching becomes frequent at some point, so I don't want to expose too many internal logic of switching in INodeDirectly. But the final memory usage of the new data structure is the same as ArrayList, even it has a reference to b-tree, and it supports Shrinkable I will give the memory usage after posting a new patch. {quote} I am also worry about the potiental bugs and the risk. If there is a bug in B-Tree, it is possible to lose one or more sub trees and, as a result, lose a lot of data. ArrayList is already well-tested. Anyway, we need more tests for the B-Tree, especially some long running random tests. {quote} Sure, I will add more tests for it, I have added many tests including some long running. I agree with that a bug-free data structure implementation is not easy, we should be careful and test the new implementations extensively :) was (Author: hitliuyi): Hi [~szetszwo], thanks for your comments. I will do as your suggestion: use an array list if the children size is small (<= 4K), otherwise use B-Tree. But there may be some differences for the implementation details. {quote} The field in INodeDirectory is List children which may refer to either an ArrayList or a BTreeList. We may replace the list at runtime {quote} It's hard to use a List, since when we use ArrayList, we do searching before index/delete and access through index, but BTree is in-order and we don't access through index. So I think we do it in following way: # actually I made an initial patch of switching between array list and b-tree few days ago. The logic of ArrayList is not complicated, so I implement it in a new class and support shrinkable, in the new class, I control the array and expanding, also the new class keeps reference to a b-tree, if the elements size becomes large, it switches to use the b-tree. The new class is an in-order data structure, not like ArrayList which we need to search before operating. In INodeDirectory, we just need to use the new class, another reason of I implement a new data structure class and don't do the switching in INodeDirectory is: we should have a {{SWITCH_THRESHOLD}} for switching from array list to b-tree, and need a low water mark to switch back, they should not be the same value, otherwise, the switching becomes frequent at some point, so I don't want to expose too many internal logic of switching in INodeDirectly. But the final memory usage of the new data structure is the same as ArrayList, even it has a reference to b-tree, and it supports Shrinkable I will give the memory usage after posting a new patch. {quote} I am also worry about the potiental bugs and the risk. If there is a bug in B-Tree, it is possible to lose one or more sub trees and, as a result, lose a lot of data. ArrayList is already well-tested. Anyway, we need more tests for the B-Tree, especially some long running random tests. {quote} Sure, I will add more tests for it, I have added many tests including some long running. I agree with that a bug-free data structure implementation is not easy, we should be careful and test the new implementations extensively :) > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL:
[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942004#comment-14942004 ] Yi Liu edited comment on HDFS-9053 at 10/3/15 12:49 AM: Thanks [~szetszwo], good comment, I ever considered it carefully too. I want to convince you to allow me only use B-Tree here: # Use the case you said, the #children is small and < 4K. *1)* If children is < 2K, then B-Tree only contains a root. As we counted before, the increased overhead is only 44 bytes which is really very small for a directory, a continuous block is 80 bytes memory (detail below), so we only increase about 1/2 continuous block for a directory in NN. *2)* If the children is > 2K and < 4K, here we use 4K as example, the B-Tree at most contains 3 branches: 1 root node, 3 leaf nodes. One leaf node increase about (40 bytes + 16 bytes elements array overhead) = 56 bytes, and 1 root node is (40 bytes + 16 bytes elements array overhead + 16 bytes children overhead + 3 children * 8) = 96 bytes, the b-tree itself is 40 bytes, and we need to subtract the ArrayList (40 bytes + 16 bytes elements array overhead) = 56 bytes, so we at most increase 56 * 3 + 96 + 40 - 56 = 248 bytes overhead, but ArrayList of 4K references to INode needs more than 4K * 8 = 32K memory, then we can get that the increased memory is only *0.75%* {noformat} org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoContiguous object internals: OFFSET SIZE TYPE DESCRIPTIONVALUE 016 (object header)N/A 16 8 long Block.blockId N/A 24 8 long Block.numBytes N/A 32 8 long Block.generationStamp N/A 40 8 long BlockInfo.bcId N/A 48 2 short BlockInfo.replication N/A 50 6 (alignment/padding gap)N/A 56 8 LinkedElement BlockInfo.nextLinkedElementN/A 64 8 Object[] BlockInfo.triplets N/A 72 8 BlockUnderConstructionFeature BlockInfo.uc N/A Instance size: 80 bytes (estimated, the sample instance is not available) {noformat} # One advantage of B-Tree compared to ArrayList for small size children is: B-Tree can shrink. If the children of directory decreases from 4K to less than 2K, there are 2K * 8 = 16K memory wasteful if using ArrayList. # On the other hand, if we do the switch between ArrayList and B-Tree, we may need write a class to wrap the two data structures, then it still needs 16bytes object overhead + 8 bytes additional reference = 24 bytes. How do you say? Thanks, Nicholas. was (Author: hitliuyi): Thanks [~szetszwo], good comment, I ever considered it carefully too. I want to convince you to allow me only use B-Tree here: # Use the case you said, the #children is small and < 4K. *1)* If children is < 2K, then B-Tree only contains a root. As we counted before, the increased overhead is only 44 bytes which is really very small for a directory, a continuous block is 80 bytes memory (detail below), so we only increase about 1/2 continuous block for a directory in NN. *2)* If the children is > 2K and < 4K, here we use 4K as example, the B-Tree at most contains 3 branches: 1 root node, 3 leaf nodes. One leaf node increase about (40 bytes + 16 bytes elements array overhead) = 56 bytes, and 1 root node is (40 bytes + 16 bytes elements array overhead + 16 bytes children overhead + 3 children * 8) = 96 bytes, the b-tree itself is 40 bytes, and we need to subtract the ArrayList (40 bytes + 16 bytes elements array overhead) = 56 bytes, so we at most increase 56 * 3 + 96 + 40 - 56 = 248 bytes overhead, but ArrayList of 4K references to INode needs more than 4K * 8 = 32K memory, then we can get that the increased memory is only *0.75%* {noformat} org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoContiguous object internals: OFFSET SIZE TYPE DESCRIPTIONVALUE 016 (object header)N/A 16 8 long Block.blockId N/A 24 8 long Block.numBytes N/A 32 8 long Block.generationStamp N/A 40 8 long BlockInfo.bcId N/A 48 2 short BlockInfo.replication N/A 50 6 (alignment/padding gap)N/A 56 8 LinkedElement BlockInfo.nextLinkedElementN/A 64 8
[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14942004#comment-14942004 ] Yi Liu edited comment on HDFS-9053 at 10/3/15 12:47 AM: Thanks [~szetszwo], good comment, I ever considered it carefully too. I want to convince you to allow me only use B-Tree here: # Use the case you said, the #children is small and < 4K. *1)* If children is < 2K, then B-Tree only contains a root. As we counted before, the increased overhead is only 44 bytes which is really very small for a directory, a continuous block is 80 bytes memory (detail below), so we only increase about 1/2 continuous block for a directory in NN. *2)* If the children is > 2K and < 4K, here we use 4K as example, the B-Tree at most contains 3 branches: 1 root node, 3 leaf nodes. One leaf node increase about (40 bytes + 16 bytes elements array overhead) = 56 bytes, and 1 root node is (40 bytes + 16 bytes elements array overhead + 16 bytes children overhead + 3 children * 8) = 96 bytes, the b-tree itself is 40 bytes, and we need to subtract the ArrayList (40 bytes + 16 bytes elements array overhead) = 56 bytes, so we at most increase 56 * 3 + 96 + 40 - 56 = 248 bytes overhead, but ArrayList of 4K references to INode needs more than 4K * 8 = 32K memory, then we can get that the increased memory is only *0.75%* {noformat} org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoContiguous object internals: OFFSET SIZE TYPE DESCRIPTIONVALUE 016 (object header)N/A 16 8 long Block.blockId N/A 24 8 long Block.numBytes N/A 32 8 long Block.generationStamp N/A 40 8 long BlockInfo.bcId N/A 48 2 short BlockInfo.replication N/A 50 6 (alignment/padding gap)N/A 56 8 LinkedElement BlockInfo.nextLinkedElementN/A 64 8 Object[] BlockInfo.triplets N/A 72 8 BlockUnderConstructionFeature BlockInfo.uc N/A Instance size: 80 bytes (estimated, the sample instance is not available) {noformat} # One advantage of B-Tree compared to ArrayList for small size children is: B-Tree can shrink. If the children of directory decreases from 4K to less than 2K, there are 2K * 8 = 16K memory wasteful if suing ArrayList. # On the other hand, if we do the switch between ArrayList and B-Tree, we may need write a class to wrap the two data structures, then it still needs 16bytes object overhead + 8 bytes additional reference = 24 bytes. How do you say? Thanks, Nicholas. was (Author: hitliuyi): Thanks [~szetszwo], good comment, I ever considered it carefully too. I want to convince you to allow me only use B-Tree here: # Use the case you said, the #children is small and < 4K. *1)* If children is < 2K, then B-Tree only contains a root. As we counted before, the increased overhead is only 44 bytes which is really very small for a directory, a continuous block is 80 bytes memory (detail below), so we only increase about 1/2 continuous block for a directory in NN. *2)* If the children is > 2K and < 4K, here we use 4K as example, the B-Tree at most contains 3 branches: 1 root node, 3 leaf nodes. One leaf node increase about (40 bytes + 16 bytes elements array overhead) = 56 bytes, and 1 root node is (40 bytes + 16 bytes elements array overhead + 16 bytes children overhead + 3 children * 8) = 96 bytes, the b-tree itself is 40 bytes, and we need to subtract the ArrayList (40 bytes + 16 bytes elements array overhead) = 56 bytes, so we at most increase 56 * 3 + 96 + 40 - 56 = 248 bytes overhead, but ArrayList of 4K references to INode needs more than 4K * 8 = 32K memory, then we can get that the increased memory is only *0.75%* {noformat} org.apache.hadoop.hdfs.server.blockmanagement.BlockInfoContiguous object internals: OFFSET SIZE TYPE DESCRIPTIONVALUE 016 (object header)N/A 16 8 long Block.blockId N/A 24 8 long Block.numBytes N/A 32 8 long Block.generationStamp N/A 40 8 long BlockInfo.bcId N/A 48 2 short BlockInfo.replication N/A 50 6 (alignment/padding gap)N/A 56 8 LinkedElement BlockInfo.nextLinkedElementN/A 64 8
[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14936183#comment-14936183 ] Yi Liu edited comment on HDFS-9053 at 10/1/15 1:33 AM: --- Thanks [~szetszwo] for the comments. {quote} here are quite a few errors in description and the analysis. Let's hold on the patch for the moment and fix all these errors first. {quote} Sure, let's fix the errors in the description and the analysis first, thanks for pointing out this. {quote} > So totally 12+4+4+4+4+4+4 = 32 bytes on 32-bits system/JVM, and 12+4+8+4+8+4 > = 40 bytes on 64-bits system/JVM. (I have not counted object alignment) It seems that there is at least one more reference not yet counted since Node is non-static. {quote} I mean the *increased memory* compared with using ArrayList, and I have stated this in above comments. The reference to Node is already counted, use 64-bits system/JVM as example, the first '8' is the reference to the B-Tree root. The second '8' is the null reference for children in BTree#Node. This doesn't count the memory parts which ArrayList also has: 16 bytes object overhead + 4 bytes int modCount + 8 bytes reference to elements + 4 bytes int elementsSize + elements array memory. Nicholas, I am going to fix the two places in the JIRA description and analysis in JIRA comments, which we find so far # fix the time complexity of insert/delete for ArrayList # fix the description about memory usage for B-Tree to add the system/JVM information and add more detailed information. Any other errors do you find besides these? please tell me and appreciate if you can also help to edit them directly. was (Author: hitliuyi): Thanks [~szetszwo] for the comments. {quote} here are quite a few errors in description and the analysis. Let's hold on the patch for the moment and fix all these errors first. {quote} Sure, let's fix the errors in the description and the analysis first, thanks for pointing out this. {quote} > So totally 12+4+4+4+4+4+4 = 32 bytes on 32-bits system/JVM, and 12+4+8+4+8+4 > = 40 bytes on 64-bits system/JVM. (I have not counted object alignment) It seems that there is at least one more reference not yet counted since Node is non-static. {quote} I mean the *increased memory* compared with using ArrayList, and I have stated this in above comments. The reference to Node is already counted, use 64-bits system/JVM as example, the first '8' is the reference to the B-Tree root. The second '8' is the null reference for children in BTree#Node. This doesn't count the memory parts which ArrayList also has: 12 bytes object overhead + 4 bytes int modCount + 8 bytes reference to elements + 4 bytes int elementsSize + elements array memory. Nicholas, I am going to fix the two places in the JIRA description and analysis in JIRA comments, which we find so far # fix the time complexity of insert/delete for ArrayList # fix the description about memory usage for B-Tree to add the system/JVM information and add more detailed information. Any other errors do you find besides these? please tell me and appreciate if you can also help to edit them directly. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem
[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934634#comment-14934634 ] Yi Liu edited comment on HDFS-9053 at 10/1/15 1:32 AM: --- Thanks for the comments, Nicholas. {quote} Where are the numbers, especially the 4s, from? Do we assume a 32-bit world? {quote} {code} public class BTreeimplements Iterable { ... private final int degree; private Node root; private int size; private transient int modCount = 0; ... } private final class Node { static final int DEFAULT_CAPACITY = 5; private Object[] elements; private int elementsSize; private Object[] children; private int childrenSize; ... } {code} Sorry, I should use 64-bits system/JVM to describe, and details are: Compared to ArrayList, we increases following things: private final int degree; <- 4 bytes Integer private Node root; <- reference, 4 bytes on 32-bits system/JVM, 8 bytes on 64-bits system/JVM private int size;<- 4 bytes Integer {{Node}} object overhead <-- 12 bytes private Object[] children; <- null reference, 4 bytes on 32-bits system/JVM, 8 bytes on 64-bits system/JVM private int childrenSize; <- 4 bytes Integer. So totally 12+4+4+4+4+4+4 = 32 bytes on 32-bits system/JVM, and 16+4+8+4+8+4 = 44 bytes on 64-bits system/JVM. (I have not counted object alignment) was (Author: hitliuyi): Thanks for the comments, Nicholas. {quote} Where are the numbers, especially the 4s, from? Do we assume a 32-bit world? {quote} {code} public class BTree implements Iterable { ... private final int degree; private Node root; private int size; private transient int modCount = 0; ... } private final class Node { static final int DEFAULT_CAPACITY = 5; private Object[] elements; private int elementsSize; private Object[] children; private int childrenSize; ... } {code} Sorry, I should use 64-bits system/JVM to describe, and details are: Compared to ArrayList, we increases following things: private final int degree; <- 4 bytes Integer private Node root; <- reference, 4 bytes on 32-bits system/JVM, 8 bytes on 64-bits system/JVM private int size;<- 4 bytes Integer {{Node}} object overhead <-- 12 bytes private Object[] children; <- null reference, 4 bytes on 32-bits system/JVM, 8 bytes on 64-bits system/JVM private int childrenSize; <- 4 bytes Integer. So totally 12+4+4+4+4+4+4 = 32 bytes on 32-bits system/JVM, and 12+4+8+4+8+4 = 40 bytes on 64-bits system/JVM. (I have not counted object alignment) > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size
[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934598#comment-14934598 ] Yi Liu edited comment on HDFS-9053 at 10/1/15 1:30 AM: --- Thanks Jing for the further review. {quote} Yeah, sorry for the delay. {quote} NP, thanks for the help :) I also think it's a critical part of the code, thanks a lot for the review, [~szetszwo]! also welcome [~kihwal] and anyone who can review this. {quote} For insert/delete, the running time is O( n) for ArrayList. {quote} Yeah, if we count the re-allocations and copies of big arrays, I just mean the search time complexity before we do insert/delete. {quote} How about memory usage? One reason to use ArrayList instead of a more complicated data structure is to save memory. It seems to me that using B-Tree increases memory usage in general. It increases memory usage dramatically in some worst cases such as the average branching factor of the tree is small (this may also be a common case). {quote} That's a great question here. I have given few description in the my second comment above: [here|https://issues.apache.org/jira/browse/HDFS-9053?focusedCommentId=14740814=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14740814], I would give a detailed explanation: {quote} in some worst cases such as the average branching factor of the tree is small {quote} Don't worry about this, for any B-Tree node except root, it contains at least min degree size of elements, otherwise there is merging or element shift to guarantee this property. Besides, we expand the elements in B-Tree node the same as ArrayList. Generally say, the B-Tree in the patch increases *very few* memory usage, and can be ignored: # *For normal and small size directories*, let's say it's (INode) children size is not large and then B-Tree only contains one node. Currently the default degree of B-Tree used in INodeDirectory is 1024, so the max degree is 2047, that means one B-Tree node can at most contain 2047 elements. And the children of the B-Tree root node is null, so the increased memory usage compared to ArrayList is: one object overhead + few variables, total increase about (16 bytes object overhead + 4 int degree + 8 bytes reference to root node + 4 int size + 8 bytes null reference children + 4 bytes int childrenSize) = 44 bytes, and assume it's 64-bits system/JVM. # *For large directories*, let's say it's (INode) children size is larger than the max degree and then B-Tree contains more than one nodes. For any B-Tree node except root, it contains at least min degree size elements, in our case, it's 1023, furthermore, we expand elements of B-Tree node same as ArrayList, so it's the same as ArrayList at this point, typically the worst case for B-Tree from memory point view is all elements are in-order, in this case, when we split a B-Tree node, we allocate the elements size of memory and do expanding, but no more elements are added later in the split left B-Tree node, and there is a waste of 1/3 size of elements array memory, but actually it's almost the same as the worst cases even for ArrayList if there is no/few elements added after expanding, but in practice, the elements are not in order. Now back to the overhead, for B-Tree, for every B-Tree node, the increased memory usage is: one object overhead + one element pointer in the father + 2 variables, so total increase about (2* 16 two object overhead + 8 bytes reference to node in the father + 4 bytes int elementsSize + 8 bytes reference to children + 4 bytes in childrenSize) = 56 bytes for every new added B-Tree node, but one B-Tree node contains at least (degree - 1) elements, and at most (2* degree - 1) elements. Another small increased memory is the object overhead of the children in the B-Tree inner node for every (degree - 1) to (2* degree - 1) children B-Tree nodes. In conclusion, the increased memory used for B-Tree compared to ArrayList is very small, for small/normal directories, it's only 44 bytes overhead which is even less than memory of one block in NN, for large directories, besides additional increased (4 bytes int degree + 8 bytes reference to root node + 4 bytes int size) = 16 bytes memory for the variables in B-Tree , it will increase about 56 bytes memory for every new added B-Tree node which can contain 1023 ~ 2047 INodes in the directory in our case. We can ignore these few memory overhead for a directory. And last, the benefits of B-Tree is great and obvious as described in the JIRA description, also in the patch, I consider the memory overhead carefully while implementing the B-Tree. Please let me know if you have further questions. Thanks a lot. was (Author: hitliuyi): Thanks Jing for the further review. {quote} Yeah, sorry for the delay. {quote} NP, thanks for the help :) I also think it's a critical part of the
[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934689#comment-14934689 ] Yi Liu edited comment on HDFS-9053 at 9/29/15 6:18 AM: --- [~jingzhao], you have given a great review, thanks a lot, and it hits the two places I ever considered carefully how to do it better. {quote} In INodeDirectory#replaceChild, can we directly call addOrReplace instead of calling get first? {quote} I think we can use {{addOrReplace}} directly. Since there should be an INode with the name existing. To keep original behavior, I will remove the added one if {{addOrReplace}} returns null. {quote} Do you think we can avoid the following code? Maybe we can add the EK type to the ReadOnlyCollection/ReadOnlyList level? {quote} That's a good comment, I ever considered this carefully. I also thought adding the EK type as one of generic type to the ReadOnlyCollection/ReadOnlyList level, but I felt it looked not natural for a collection/list, and not all implementations of ReadOnlyList need to implement iterating starting from specified element, also I though it was OK since it's a private interface we use in HDFS. I will leave this comment in next version of patch, if you feel we'd better to do this, I will update it, I am OK with the both ways. {quote} DirectoryWithSnapshotFeature#getChildrenList#iterator(EK) forgot to increase pos? Maybe also add a new test for this (e.g., set a small ls limit and list a snapshot of a directory)? {quote} Great catch, let me update it, and add a new test in {{TestLargeDirectory}} to cover. {quote} In getListing, instead of continuing the iteration, can we just call size() to calculate the number of the remaining items? {quote} I ever tried to find a better way. {{size()}} will return the total number of elements in B-Tree, but we don't know the current index, so seems not able to calculate the number of the remaining items. was (Author: hitliuyi): [~jingzhao], you have given a good review, and it hits the two places I ever considered carefully how to do it better. Thanks. {quote} In INodeDirectory#replaceChild, can we directly call addOrReplace instead of calling get first? {quote} I think we can use {{addOrReplace}} directly. Since there should be an INode with the name existing. To keep original behavior, I will remove the added one if {{addOrReplace}} returns null. {quote} Do you think we can avoid the following code? Maybe we can add the EK type to the ReadOnlyCollection/ReadOnlyList level? {quote} That's a good comment, I ever considered this carefully. I also thought adding the EK type as one of generic type to the ReadOnlyCollection/ReadOnlyList level, but I felt it looked not natural for a collection/list, and not all implementations of ReadOnlyList need to implement iterating starting from specified element, also I though it was OK since it's a private interface we use in HDFS. I will leave this comment in next version of patch, if you feel we'd better to do this, I will update it, I am OK with the both ways. {quote} DirectoryWithSnapshotFeature#getChildrenList#iterator(EK) forgot to increase pos? Maybe also add a new test for this (e.g., set a small ls limit and list a snapshot of a directory)? {quote} Great catch, let me update it, and add a new test in {{TestLargeDirectory}} to cover. {quote} In getListing, instead of continuing the iteration, can we just call size() to calculate the number of the remaining items? {quote} I ever tried to find a better way. {{size()}} will return the total number of elements in B-Tree, but we don't know the current index, so seems not able to calculate the number of the remaining items. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete/search, the time > complexity is O(log n), but insertion/deleting causes re-allocations and > copies of big arrays, so the operations are costly. For example, if the > children grow to 1M size, the ArrayList will resize to > 1M capacity, so need > > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS > cluster where namenode heap memory is already highly used. I recap the 3 > main issues: > # Insertion/deletion
[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934689#comment-14934689 ] Yi Liu edited comment on HDFS-9053 at 9/29/15 6:55 AM: --- [~jingzhao], you have given a great review, thanks a lot, and it hits the two places I ever considered carefully how to do it better. {quote} In INodeDirectory#replaceChild, can we directly call addOrReplace instead of calling get first? {quote} Good point, I think we can use {{addOrReplace}} directly. Since there should be an INode with the name existing. To keep original behavior, I will remove the added one if {{addOrReplace}} returns null. {quote} Do you think we can avoid the following code? Maybe we can add the EK type to the ReadOnlyCollection/ReadOnlyList level? {quote} That's a good comment, I ever considered this carefully. I also thought adding the EK type as one of generic type to the ReadOnlyCollection/ReadOnlyList level, but I felt it looked not natural for a collection/list, and not all implementations of ReadOnlyList need to implement iterating starting from specified element, also I though it was OK since it's a private interface we use in HDFS. I will update according to your comment, if you feel it's not good later, I can change it back. {quote} DirectoryWithSnapshotFeature#getChildrenList#iterator(EK) forgot to increase pos? Maybe also add a new test for this (e.g., set a small ls limit and list a snapshot of a directory)? {quote} Great catch, let me update it, and add a new test in {{TestLargeDirectory}} to cover. {quote} In getListing, instead of continuing the iteration, can we just call size() to calculate the number of the remaining items? {quote} I ever tried to find a better way. {{size()}} will return the total number of elements in B-Tree, but we don't know the current index, so seems not able to calculate the number of the remaining items. was (Author: hitliuyi): [~jingzhao], you have given a great review, thanks a lot, and it hits the two places I ever considered carefully how to do it better. {quote} In INodeDirectory#replaceChild, can we directly call addOrReplace instead of calling get first? {quote} Good point, I think we can use {{addOrReplace}} directly. Since there should be an INode with the name existing. To keep original behavior, I will remove the added one if {{addOrReplace}} returns null. {quote} Do you think we can avoid the following code? Maybe we can add the EK type to the ReadOnlyCollection/ReadOnlyList level? {quote} That's a good comment, I ever considered this carefully. I also thought adding the EK type as one of generic type to the ReadOnlyCollection/ReadOnlyList level, but I felt it looked not natural for a collection/list, and not all implementations of ReadOnlyList need to implement iterating starting from specified element, also I though it was OK since it's a private interface we use in HDFS. I will leave this comment in next version of patch, if you feel we'd better to do this, I will update it, I am OK with the both ways. {quote} DirectoryWithSnapshotFeature#getChildrenList#iterator(EK) forgot to increase pos? Maybe also add a new test for this (e.g., set a small ls limit and list a snapshot of a directory)? {quote} Great catch, let me update it, and add a new test in {{TestLargeDirectory}} to cover. {quote} In getListing, instead of continuing the iteration, can we just call size() to calculate the number of the remaining items? {quote} I ever tried to find a better way. {{size()}} will return the total number of elements in B-Tree, but we don't know the current index, so seems not able to calculate the number of the remaining items. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete/search, the time > complexity is O(log n), but insertion/deleting causes re-allocations and > copies of big arrays, so the operations are costly. For example, if the > children grow to 1M size, the ArrayList will resize to > 1M capacity, so need > > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS > cluster where namenode heap memory is already highly used. I recap the 3 > main issues: > # Insertion/deletion operations
[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934689#comment-14934689 ] Yi Liu edited comment on HDFS-9053 at 9/29/15 6:20 AM: --- [~jingzhao], you have given a great review, thanks a lot, and it hits the two places I ever considered carefully how to do it better. {quote} In INodeDirectory#replaceChild, can we directly call addOrReplace instead of calling get first? {quote} Good point, I think we can use {{addOrReplace}} directly. Since there should be an INode with the name existing. To keep original behavior, I will remove the added one if {{addOrReplace}} returns null. {quote} Do you think we can avoid the following code? Maybe we can add the EK type to the ReadOnlyCollection/ReadOnlyList level? {quote} That's a good comment, I ever considered this carefully. I also thought adding the EK type as one of generic type to the ReadOnlyCollection/ReadOnlyList level, but I felt it looked not natural for a collection/list, and not all implementations of ReadOnlyList need to implement iterating starting from specified element, also I though it was OK since it's a private interface we use in HDFS. I will leave this comment in next version of patch, if you feel we'd better to do this, I will update it, I am OK with the both ways. {quote} DirectoryWithSnapshotFeature#getChildrenList#iterator(EK) forgot to increase pos? Maybe also add a new test for this (e.g., set a small ls limit and list a snapshot of a directory)? {quote} Great catch, let me update it, and add a new test in {{TestLargeDirectory}} to cover. {quote} In getListing, instead of continuing the iteration, can we just call size() to calculate the number of the remaining items? {quote} I ever tried to find a better way. {{size()}} will return the total number of elements in B-Tree, but we don't know the current index, so seems not able to calculate the number of the remaining items. was (Author: hitliuyi): [~jingzhao], you have given a great review, thanks a lot, and it hits the two places I ever considered carefully how to do it better. {quote} In INodeDirectory#replaceChild, can we directly call addOrReplace instead of calling get first? {quote} I think we can use {{addOrReplace}} directly. Since there should be an INode with the name existing. To keep original behavior, I will remove the added one if {{addOrReplace}} returns null. {quote} Do you think we can avoid the following code? Maybe we can add the EK type to the ReadOnlyCollection/ReadOnlyList level? {quote} That's a good comment, I ever considered this carefully. I also thought adding the EK type as one of generic type to the ReadOnlyCollection/ReadOnlyList level, but I felt it looked not natural for a collection/list, and not all implementations of ReadOnlyList need to implement iterating starting from specified element, also I though it was OK since it's a private interface we use in HDFS. I will leave this comment in next version of patch, if you feel we'd better to do this, I will update it, I am OK with the both ways. {quote} DirectoryWithSnapshotFeature#getChildrenList#iterator(EK) forgot to increase pos? Maybe also add a new test for this (e.g., set a small ls limit and list a snapshot of a directory)? {quote} Great catch, let me update it, and add a new test in {{TestLargeDirectory}} to cover. {quote} In getListing, instead of continuing the iteration, can we just call size() to calculate the number of the remaining items? {quote} I ever tried to find a better way. {{size()}} will return the total number of elements in B-Tree, but we don't know the current index, so seems not able to calculate the number of the remaining items. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete/search, the time > complexity is O(log n), but insertion/deleting causes re-allocations and > copies of big arrays, so the operations are costly. For example, if the > children grow to 1M size, the ArrayList will resize to > 1M capacity, so need > > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS > cluster where namenode heap memory is already highly used. I recap the 3 > main issues: > #
[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934598#comment-14934598 ] Yi Liu edited comment on HDFS-9053 at 9/30/15 1:29 AM: --- Thanks Jing for the further review. {quote} Yeah, sorry for the delay. {quote} NP, thanks for the help :) I also think it's a critical part of the code, thanks a lot for the review, [~szetszwo]! also welcome [~kihwal] and anyone who can review this. {quote} For insert/delete, the running time is O( n) for ArrayList. {quote} Yeah, if we count the re-allocations and copies of big arrays, I just mean the search time complexity before we do insert/delete. {quote} How about memory usage? One reason to use ArrayList instead of a more complicated data structure is to save memory. It seems to me that using B-Tree increases memory usage in general. It increases memory usage dramatically in some worst cases such as the average branching factor of the tree is small (this may also be a common case). {quote} That's a great question here. I have given few description in the my second comment above: [here|https://issues.apache.org/jira/browse/HDFS-9053?focusedCommentId=14740814=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14740814], I would give a detailed explanation: {quote} in some worst cases such as the average branching factor of the tree is small {quote} Don't worry about this, for any B-Tree node except root, it contains at least min degree size of elements, otherwise there is merging or element shift to guarantee this property. Besides, we expand the elements in B-Tree node the same as ArrayList. Generally say, the B-Tree in the patch increases *very few* memory usage, and can be ignored: # *For normal and small size directories*, let's say it's (INode) children size is not large and then B-Tree only contains one node. Currently the default degree of B-Tree used in INodeDirectory is 1024, so the max degree is 2047, that means one B-Tree node can at most contain 2047 elements. And the children of the B-Tree root node is null, so the increased memory usage compared to ArrayList is: one object overhead + few variables, total increase about (12 bytes object overhead + 4 int degree + 8 bytes reference to root node + 4 int size + 8 bytes null reference children + 4 bytes int childrenSize) = 40 bytes, and assume it's 64-bits system/JVM. # *For large directories*, let's say it's (INode) children size is larger than the max degree and then B-Tree contains more than one nodes. For any B-Tree node except root, it contains at least min degree size elements, in our case, it's 1023, furthermore, we expand elements of B-Tree node same as ArrayList, so it's the same as ArrayList at this point, typically the worst case for B-Tree from memory point view is all elements are in-order, in this case, when we split a B-Tree node, we allocate the elements size of memory and do expanding, but no more elements are added later in the split left B-Tree node, and there is a waste of 1/3 size of elements array memory, but actually it's almost the same as the worst cases even for ArrayList if there is no/few elements added after expanding, but in practice, the elements are not in order. Now back to the overhead, for B-Tree, for every B-Tree node, the increased memory usage is: one object overhead + one element pointer in the father + 2 variables, so total increase about (2* 12 two object overhead + 8 bytes reference to node in the father + 4 bytes int elementsSize + 8 bytes reference to children + 4 bytes in childrenSize) = 48 bytes for every new added B-Tree node, but one B-Tree node contains at least (degree - 1) elements, and at most (2* degree - 1) elements. Another small increased memory is the object overhead of the children in the B-Tree inner node for every (degree - 1) to (2* degree - 1) children B-Tree nodes. In conclusion, the increased memory used for B-Tree compared to ArrayList is very small, for small/normal directories, it's only 40 bytes overhead which is even less than memory of one block in NN, for large directories, besides additional increased (4 bytes int degree + 8 bytes reference to root node + 4 bytes int size) = 16 bytes memory for the variables in B-Tree , it will increase about 48 bytes memory for every new added B-Tree node which can contain 1023 ~ 2047 INodes in the directory in our case. We can ignore these few memory overhead for a directory. And last, the benefits of B-Tree is great and obvious as described in the JIRA description, also in the patch, I consider the memory overhead carefully while implementing the B-Tree. Please let me know if you have further questions. Thanks a lot. was (Author: hitliuyi): Thanks Jing for the further review. {quote} Yeah, sorry for the delay. {quote} NP, thanks for the help :) I also think it's a critical part of the
[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14936183#comment-14936183 ] Yi Liu edited comment on HDFS-9053 at 9/30/15 3:48 AM: --- Thanks [~szetszwo] for the comments. {quote} here are quite a few errors in description and the analysis. Let's hold on the patch for the moment and fix all these errors first. {quote} Sure, let's fix the errors in the description and the analysis first, thanks for pointing out this. {quote} > So totally 12+4+4+4+4+4+4 = 32 bytes on 32-bits system/JVM, and 12+4+8+4+8+4 > = 40 bytes on 64-bits system/JVM. (I have not counted object alignment) It seems that there is at least one more reference not yet counted since Node is non-static. {quote} The reference to Node is already counted, use 64-bits system/JVM as example, the first '8' is the reference to the B-Tree root. The second '8' is the null reference for children in BTree#Node, of course we are talking about the directory which is not large and only contains one B-Tree node. Nicholas, I am going to fix the two places in the JIRA description and analysis in JIRA comments, which we find so far # fix the time complexity of insert/delete for ArrayList # fix the description about memory usage for B-Tree to add the system/JVM information and add more detailed information. Any other errors do you find besides these? please tell me and appreciate if you can also help to edit them directly. was (Author: hitliuyi): Thanks [~szetszwo] for the comments. {quote} here are quite a few errors in description and the analysis. Let's hold on the patch for the moment and fix all these errors first. {quote} Sure, let's fix the errors in the description and the analysis first, thanks for pointing out this. {quote} > So totally 12+4+4+4+4+4+4 = 32 bytes on 32-bits system/JVM, and 12+4+8+4+8+4 > = 40 bytes on 64-bits system/JVM. (I have not counted object alignment) It seems that there is at least one more reference not yet counted since Node is non-static. {quote} The reference to Node is already counted, use 64-bits system/JVM as example, the first '8' is the reference to the B-Tree root. The second '8' is the null reference for children in BTree#Node, of course we are talking about the directory which is not large and only contains one B-Tree node. Nicholas, I am going to fix the two places in the JIRA description and analysis in JIRA comments, which we find so far # fix the time complexity of insert/delete for ArrayList # fix the description about memory usage for B-Tree to add the system/JVM information and add more detailed information. Any other errors do you find besides these? please tell me and appreciate if you can also help to edit them directly. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then
[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14936183#comment-14936183 ] Yi Liu edited comment on HDFS-9053 at 9/30/15 4:04 AM: --- Thanks [~szetszwo] for the comments. {quote} here are quite a few errors in description and the analysis. Let's hold on the patch for the moment and fix all these errors first. {quote} Sure, let's fix the errors in the description and the analysis first, thanks for pointing out this. {quote} > So totally 12+4+4+4+4+4+4 = 32 bytes on 32-bits system/JVM, and 12+4+8+4+8+4 > = 40 bytes on 64-bits system/JVM. (I have not counted object alignment) It seems that there is at least one more reference not yet counted since Node is non-static. {quote} I mean the *increased memory* compared with using ArrayList, and I have stated this in above comments. The reference to Node is already counted, use 64-bits system/JVM as example, the first '8' is the reference to the B-Tree root. The second '8' is the null reference for children in BTree#Node. This doesn't count the memory parts which ArrayList also has: 12 bytes object overhead + 4 bytes int modCount + 8 bytes reference to elements + 4 bytes int elementsSize + elements array memory. Nicholas, I am going to fix the two places in the JIRA description and analysis in JIRA comments, which we find so far # fix the time complexity of insert/delete for ArrayList # fix the description about memory usage for B-Tree to add the system/JVM information and add more detailed information. Any other errors do you find besides these? please tell me and appreciate if you can also help to edit them directly. was (Author: hitliuyi): Thanks [~szetszwo] for the comments. {quote} here are quite a few errors in description and the analysis. Let's hold on the patch for the moment and fix all these errors first. {quote} Sure, let's fix the errors in the description and the analysis first, thanks for pointing out this. {quote} > So totally 12+4+4+4+4+4+4 = 32 bytes on 32-bits system/JVM, and 12+4+8+4+8+4 > = 40 bytes on 64-bits system/JVM. (I have not counted object alignment) It seems that there is at least one more reference not yet counted since Node is non-static. {quote} I mean the *increased memory* compared with using ArrayList, and I have stated this in above comments. The reference to Node is already counted, use 64-bits system/JVM as example, the first '8' is the reference to the B-Tree root. The second '8' is the null reference for children in BTree#Node. This doesn't count the memory parts which ArrayList also has: 12 bytes object overhead + 4 bytes int modCount + 8 bytes reference to elements + 4 bytes int elementsSize. Nicholas, I am going to fix the two places in the JIRA description and analysis in JIRA comments, which we find so far # fix the time complexity of insert/delete for ArrayList # fix the description about memory usage for B-Tree to add the system/JVM information and add more detailed information. Any other errors do you find besides these? please tell me and appreciate if you can also help to edit them directly. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So
[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14936183#comment-14936183 ] Yi Liu edited comment on HDFS-9053 at 9/30/15 3:55 AM: --- Thanks [~szetszwo] for the comments. {quote} here are quite a few errors in description and the analysis. Let's hold on the patch for the moment and fix all these errors first. {quote} Sure, let's fix the errors in the description and the analysis first, thanks for pointing out this. {quote} > So totally 12+4+4+4+4+4+4 = 32 bytes on 32-bits system/JVM, and 12+4+8+4+8+4 > = 40 bytes on 64-bits system/JVM. (I have not counted object alignment) It seems that there is at least one more reference not yet counted since Node is non-static. {quote} I mean the *increased memory* compared with using ArrayList, and I have stated this in above comments. The reference to Node is already counted, use 64-bits system/JVM as example, the first '8' is the reference to the B-Tree root. The second '8' is the null reference for children in BTree#Node. This doesn't count the memory parts which ArrayList also has: 12 bytes object overhead + 4 bytes int modCount + 8 bytes reference to elements + 4 bytes int elementsSize. Nicholas, I am going to fix the two places in the JIRA description and analysis in JIRA comments, which we find so far # fix the time complexity of insert/delete for ArrayList # fix the description about memory usage for B-Tree to add the system/JVM information and add more detailed information. Any other errors do you find besides these? please tell me and appreciate if you can also help to edit them directly. was (Author: hitliuyi): Thanks [~szetszwo] for the comments. {quote} here are quite a few errors in description and the analysis. Let's hold on the patch for the moment and fix all these errors first. {quote} Sure, let's fix the errors in the description and the analysis first, thanks for pointing out this. {quote} > So totally 12+4+4+4+4+4+4 = 32 bytes on 32-bits system/JVM, and 12+4+8+4+8+4 > = 40 bytes on 64-bits system/JVM. (I have not counted object alignment) It seems that there is at least one more reference not yet counted since Node is non-static. {quote} The reference to Node is already counted, use 64-bits system/JVM as example, the first '8' is the reference to the B-Tree root. The second '8' is the null reference for children in BTree#Node, of course we are talking about the directory which is not large and only contains one B-Tree node. Nicholas, I am going to fix the two places in the JIRA description and analysis in JIRA comments, which we find so far # fix the time complexity of insert/delete for ArrayList # fix the description about memory usage for B-Tree to add the system/JVM information and add more detailed information. Any other errors do you find besides these? please tell me and appreciate if you can also help to edit them directly. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch, HDFS-9053.003.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete, the time complexity is > O\(n), (the search is O(log n), but insertion/deleting causes re-allocations > and copies of arrays), for large directory, the operations are expensive. If > the children grow to 1M size, the ArrayList will resize to > 1M capacity, so > need > 1M * 8bytes = 8M (the reference size is 8 for 64-bits system/JVM) > continuous heap memory, it easily causes full GC in HDFS cluster where > namenode heap memory is already highly used. I recap the 3 main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the
[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934634#comment-14934634 ] Yi Liu edited comment on HDFS-9053 at 9/29/15 5:15 AM: --- Thanks for the comments, Nicholas. {quote} Where are the numbers, especially the 4s, from? Do we assume a 32-bit world? {quote} {code} public class BTreeimplements Iterable { ... private final int degree; private Node root; private int size; private transient int modCount = 0; ... } private final class Node { static final int DEFAULT_CAPACITY = 5; private Object[] elements; private int elementsSize; private Object[] children; private int childrenSize; ... } {code} Sorry, I should use 64-bits system/JVM to describe, and details are: Compared to ArrayList, we increases following things: private final int degree; <- 4 bytes Integer private Node root; <- reference, 4 bytes on 32-bits system/JVM, 8 bytes on 64-bits system/JVM private int size;<- 4 bytes Integer {{Node}} object overhead <-- 12 bytes private Object[] children; <- null reference, 4 bytes on 32-bits system/JVM, 8 bytes on 64-bits system/JVM private int childrenSize; <- 4 bytes Integer. So totally 12+4+4+4+4+4+4 = 32 bytes on 32-bits system/JVM, and 12+4+8+4+8+4 = 40 bytes on 64-bits system/JVM. (I have not counted object alignment) was (Author: hitliuyi): Thanks for the comments, Nicholas. {quote} Where are the numbers, especially the 4s, from? Do we assume a 32-bit world? {quote} {code} public class BTree implements Iterable { ... private final int degree; private Node root; private int size; private transient int modCount = 0; ... } private final class Node { static final int DEFAULT_CAPACITY = 5; private Object[] elements; private int elementsSize; private Object[] children; private int childrenSize; ... } {code} Sorry, I should use 64-bits system/JVM, and details are: Compared to ArrayList, we increases following things: private final int degree; <- 4 bytes Integer private Node root; <- reference, 4 bytes on 32-bits system/JVM, 8 bytes on 64-bits system/JVM private int size;<- 4 bytes Integer {{Node}} object overhead <-- 12 bytes private Object[] children; <- null reference, 4 bytes on 32-bits system/JVM, 8 bytes on 64-bits system/JVM private int childrenSize; <- 4 bytes Integer. So totally 12+4+4+4+4+4+4 = 32 bytes on 32-bits system/JVM, and 12+4+8+4+8+4 = 40 bytes on 64-bits system/JVM. (I have not counted object alignment) > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete/search, the time > complexity is O(log n), but insertion/deleting causes re-allocations and > copies of big arrays, so the operations are costly. For example, if the > children grow to 1M size, the ArrayList will resize to > 1M capacity, so need > > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS > cluster where namenode heap memory is already highly used. I recap the 3 > main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can
[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934598#comment-14934598 ] Yi Liu edited comment on HDFS-9053 at 9/29/15 4:48 AM: --- Thanks Jing for the further review. {quote} Yeah, sorry for the delay. {quote} NP, thanks for the help :) I also think it's a critical part of the code, thanks a lot for the review, [~szetszwo]! also welcome [~kihwal] and anyone who can review this. {quote} For insert/delete, the running time is O( n) for ArrayList. {quote} Yeah, if we count the re-allocations and copies of big arrays, I just mean the search time complexity before we do insert/delete. {quote} How about memory usage? One reason to use ArrayList instead of a more complicated data structure is to save memory. It seems to me that using B-Tree increases memory usage in general. It increases memory usage dramatically in some worst cases such as the average branching factor of the tree is small (this may also be a common case). {quote} That's a great question here. I have given few description in the my second comment above: [here|https://issues.apache.org/jira/browse/HDFS-9053?focusedCommentId=14740814=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14740814], I would give a detailed explanation: {quote} in some worst cases such as the average branching factor of the tree is small {quote} Don't worry about this, for any B-Tree node except root, it contains at least min degree size of elements, otherwise there is merging or element shift to guarantee this property. Besides, we expand the elements in B-Tree node the same as ArrayList. Generally say, the B-Tree in the patch increases *very few* memory usage, and can be ignored: # *For normal and small size directories*, let's say it's (INode) children size is not large and then B-Tree only contains one node. Currently the default degree of B-Tree used in INodeDirectory is 1024, so the max degree is 2047, that means one B-Tree node can at most contain 2047 elements. And the children of the B-Tree root node is null, so the increased memory usage compared to ArrayList is: one object overhead + few variables, total increase about (12+4+4+4+4+4) = 32 bytes. # *For large directories*, let's say it's (INode) children size is larger than the max degree and then B-Tree contains more than one nodes. For any B-Tree node except root, it contains at least min degree size elements, in our case, it's 1023, furthermore, we expand elements of B-Tree node same as ArrayList, so it's the same as ArrayList at this point, typically the worst case for B-Tree from memory point view is all elements are in-order, in this case, when we split a B-Tree node, we allocate the elements size of memory and do expanding, but no more elements are added later in the split left B-Tree node, and there is a waste of 1/3 size of elements array memory, but actually it's almost the same as the worst cases even for ArrayList if there is no/few elements added after expanding, but in practice, the elements are not in order. Now back to the overhead, for B-Tree, for every B-Tree node, the increased memory usage is: one object overhead + one element pointer in the father + 2 variables, so total increase about (12 + 4+4+4) = 24 bytes for every new added B-Tree node, but one B-Tree node contains at least (degree - 1) elements, and at most (2* degree - 1) elements. Another small increased memory is the object overhead of the children in the B-Tree inner node. In conclusion, the increased memory used for B-Tree compared to ArrayList is very small, for small/normal directories, it's only 32 bytes overhead which is even less than memory of one block in NN, for large directories, besides additional increased 12 bytes memory for the variables in B-Tree , it will increase about 24 bytes memory for every new added B-Tree node which can contain 1023 ~ 2047 INodes in the directory in our case. We can ignore these few memory overhead for a directory. And last, the benefits of B-Tree is great and obvious as described in the JIRA description, also in the patch, I consider the memory overhead carefully while implementing the B-Tree. Please let me know if you have further questions. Thanks a lot. was (Author: hitliuyi): Thanks Jing for the further review. {quote} Yeah, sorry for the delay. {quote} NP, thanks for the help :) I also think it's a critical part of the code, thanks a lot for the review, [~szetszwo]! also welcome [~kihwal] and anyone who can review this. {quote} For insert/delete, the running time is O( n) for ArrayList. {quote} Yeah, if we count the re-allocations and copies of big arrays, I just mean the search time complexity before we do insert/delete. {quote} How about memory usage? One reason to use ArrayList instead of a more complicated data structure is to save memory. It seems to me that
[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903725#comment-14903725 ] Yi Liu edited comment on HDFS-9053 at 9/23/15 12:40 AM: The Jenkins has some issue and the test failures are not related. The reason is: having multiple maven invocations going on at once sharing the same .m2 directory on the same machine, this patch includes new class file in Hadoop-common, when some other maven invocations run after this one on the same machine, "mvn install" of test will replace the hadoop-common jar in .m2 directory, so the test failures in Jenkins show NoClassDefFoundError for the new added class in hadoop-common. I will find some other time to re-trigger the Jenkins when there are few Jenkins jobs on the list. was (Author: hitliuyi): The Jenkins has some issue and the test failures are not related. The reason is: having multiple maven invocations going on at once sharing the same .m2 directory on the same machine, this patch includes new class file in Hadoop-common, when some other maven invocations run after this one on the same machine, "mvn install" of test will replace the hadoop-common jar in .m2 directory, so the test failures in Jenkins show NoClassDefFoundError for the new added class in hadoop-common. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch, HDFS-9053.001.patch, HDFS-9053.002.patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete/search, the time > complexity is O(log n), but insertion/deleting causes re-allocations and > copies of big arrays, so the operations are costly. For example, if the > children grow to 1M size, the ArrayList will resize to > 1M capacity, so need > > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS > cluster where namenode heap memory is already highly used. I recap the 3 > main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (HDFS-9053) Support large directories efficiently using B-Tree
[ https://issues.apache.org/jira/browse/HDFS-9053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14740842#comment-14740842 ] Yi Liu edited comment on HDFS-9053 at 9/15/15 12:35 AM: To integrate B-Tree with INodeDirectory, it's natural besides directory exposes {{getChildrenList}} and returns {{ReadOnlyList}}, B-Tree will be updated to implement Collection (now it implements Iterable), it's not a list. Actually most other places want to iterate all elements, and some places want to iterate starting from certain child. So my plan is adding {{ReadOnlyCollection}} and {{ReadOnlyList}} extends it, and getChildren returns {{ReadOnlyCollection}}. The {{ReadOnlyCollection}} has an interface to allow creating an Iterator starting from certain child. For snapshot, it makes sense to keep current behavior, still use list to keep the CREATED/DELETED diff. was (Author: hitliuyi): To integrate B-Tree with INodeDirectory, it's natural besides directory exposes {{getChildrenList}} and returns {{ReadOnlyList}}, B-Tree implements Iterable, it's not a list. Actually most other places want to iterate all elements, and some places want to iterate starting from certain child. So my plan is add an new Iterable interface which extends {{java.lang.Iterable}} and allow creating an iterator starting from certain child. Then in directory, getChildren returns {{ReadOnlyIterator}}, and {{ReadOnlyList}} implements it the ReadOnlyIterator. For snapshot, it makes sense to keep current behavior, still use the {{ReadOnlyList}}. > Support large directories efficiently using B-Tree > -- > > Key: HDFS-9053 > URL: https://issues.apache.org/jira/browse/HDFS-9053 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: Yi Liu >Assignee: Yi Liu >Priority: Critical > Attachments: HDFS-9053 (BTree with simple benchmark).patch, HDFS-9053 > (BTree).patch > > > This is a long standing issue, we were trying to improve this in the past. > Currently we use an ArrayList for the children under a directory, and the > children are ordered in the list, for insert/delete/search, the time > complexity is O(log n), but insertion/deleting causes re-allocations and > copies of big arrays, so the operations are costly. For example, if the > children grow to 1M size, the ArrayList will resize to > 1M capacity, so need > > 1M * 4bytes = 4M continuous heap memory, it easily causes full GC in HDFS > cluster where namenode heap memory is already highly used. I recap the 3 > main issues: > # Insertion/deletion operations in large directories are expensive because > re-allocations and copies of big arrays. > # Dynamically allocate several MB continuous heap memory which will be > long-lived can easily cause full GC problem. > # Even most children are removed later, but the directory INode still > occupies same size heap memory, since the ArrayList will never shrink. > This JIRA is similar to HDFS-7174 created by [~kihwal], but use B-Tree to > solve the problem suggested by [~shv]. > So the target of this JIRA is to implement a low memory footprint B-Tree and > use it to replace ArrayList. > If the elements size is not large (less than the maximum degree of B-Tree > node), the B-Tree only has one root node which contains an array for the > elements. And if the size grows large enough, it will split automatically, > and if elements are removed, then B-Tree nodes can merge automatically (see > more: https://en.wikipedia.org/wiki/B-tree). It will solve the above 3 > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)