[jira] [Comment Edited] (HDFS-12051) Reimplement NameCache in NameNode: Intern duplicate byte[] arrays (mainly those denoting file/directory names) to save memory
[ https://issues.apache.org/jira/browse/HDFS-12051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340176#comment-16340176 ] Misha Dmitriev edited comment on HDFS-12051 at 1/25/18 10:29 PM: - Thank you for the review, [~manojg] See my responses inline below. {{NameCache.java}} * _line 97: {{cache = new byte[cacheSize][];}} Since this will take up a contiguous space, we need to restrict the cache size to much lesser size than your current MAX size of 1 << 30. Your thoughts?_ As you can see from line 78, the cache size is always set from the configuration, which provides a reasonable default, which is much, much smaller than 1<<30. It's up to the customer to increase this value if they need. If they have a huge heap, like 120GB (I've already heard of users approaching this!), then with 1<<30 it will result in an 8GB contiguous array. With a huge heap already, it is nothing wrong, if they really need it. But, anyway, if they decide to change this number, I think it's reasonable to expect them to have some understanding of what they are doing. _{{#cache}} is now following the {{open addressing}} model. Any reasons why you moved to this model compared to your initial design?_ My own design for this cache has always been open addressing. The reason is that this is the most economical model in terms of memory: it uses just one pointer per cache entry, which is 8 bytes at most. If you use a cache with collision chains, like in java.util.HashMap, then each entry requires a pointer and a separate object (HashMap$Entry) This separate object takes at least 32 bytes, so you end up with at least 40 bytes per entry - five times more! Now, for a real HashMap, that needs to hold potentially very large number of objects, and needs to hold them all, the collision chain design may be justified in some cases. But for our specialized fixed-size cache, that strives to minimize its own memory overhead, the open addressing design is more appropriate. * _{{#put()}}_ ** _line 118: the first time cache fill .. shouldn't it be a new byte array name constructed from the passed in name? Why use the same caller passed in name?_ The goal of this cache is to _avoid_ object duplication as much as possible. If the caller gave us a name X for which we don't have an existing copy, just remember X and return it. If on the next invocation they gave us Y and it turns out to be the same as X, return X again, and Y will be effectively lost and GCed soon. * _With the {{open addressing}} model, when you overwrite the cache slot with new names, there could be INodes which are already referring to this name and are cut from the cache. Though their references are still valid, want to understand why the preference given to new names compared to the old one._ The preference is given to new names simply because it's the lesser evil. We already discussed this with [~yzhangal] in the past. First, obviously when a cache entry is overwritten, the old INodes will just continue to refer to their old names, i.e. no information is lost. Second, all our solution details stem from the fact that we don't know in advance how many names we are going to have, and how many of them will be duplicate. Thus we want to have a fixed-size cache that will be guaranteed to not waste much memory if there is little duplication, but will provide a benefit and will save a lot of memory if there is considerable duplication. Now, suppose we have a cache of size 3, and names come as follows: 'A', 'B', 'C', 'D', 'D', 'D', ... The cache would be full after the first 3 names. If after that we don't override one of the entries to accomodate 'D', we will not realize any savings from deduplicating all the subsequent 'D's. To be fair, if this cache receives something like 'A', 'B', 'C', 'D', 'E', 'F', 'A', 'B', 'C', 'D', 'E', 'F' - then it just gets rewritten all the time and provides no benefit. But in practice (and I have already implemented a similar cache in several other projects), I've never observed such a pathology. With a reasonable-size cache and real-life data, it always works. * _I don't see any cache invalidation even when the INodes are removed. This takes up memory. Though not huge, design wise its not clean to leave the cache with stale values and incur cache lookup penalty in the future put()_ This cache by default takes just 16MB, which is 0.1% of 16GB, which is on the smaller side of NN heap size spectrum. So any losses due to stale cache entries are pretty negligible. Furthermore, the above-mentioned overwriting of cache entries when new data is coming also helps to keep the cache reasonably "fresh". * _{{#getSize()}} since there is no cache invalidation, and since this open addressing model, the size returned is not right._ As the javadoc for this method explains, this method may return
[jira] [Comment Edited] (HDFS-12051) Reimplement NameCache in NameNode: Intern duplicate byte[] arrays (mainly those denoting file/directory names) to save memory
[ https://issues.apache.org/jira/browse/HDFS-12051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338473#comment-16338473 ] Manoj Govindassamy edited comment on HDFS-12051 at 1/25/18 12:23 AM: - Thanks for working on this [~mi...@cloudera.com]. Few comments on HDFS-12051.07.patch {{NameCache.java}} * line 97: {{cache = new byte[cacheSize][];}} Since this will take up a contiguous space, we need to restrict the cache size to much lesser size than your current MAX size of 1 << 30. Your thoughts? * {{#cache}} is now following the {{open addressing}} model. Any reasons why you moved to this model compared to your initial design? * {{#put()}} ** line 118: the first time cache fill .. shouldn't it be a new byte array name constructed from the passed in name? Why use the same caller passed in name? ** With the {{open addressing}} model, when you overwrite the cache slot with new names, there could be INodes which are already referring to this name and are cut from the cache. Though their references are still valid, want to understand why the preference given to new names compared to the old one. * I don't see any cache invalidation even when the INodes are removed. This takes up memory. Though not huge, design wise its not clean to leave the cache with stale values and incur cache lookup penalty in the future put() * {{#getSize()}} since there is no cache invalidation, and since this open addressing model, the size returned is not right. * line 149: {{cacheSizeFor}} is this roundUp or roundDown to the nearest 2 power. Please add the link to {{HashMap#tableSizeFor()}} in the comment to show where the code is inspired from. was (Author: manojg): Thanks for working on this [~mi...@cloudera.com]. Few comments on HDFS-12051.07.patch {{NameCache.java}} * line 97: {{cache = new byte[cacheSize][];}} Since this will take up a contiguous space, we need to restrict the cache size to much lesser size than your current MAX size of 1 << 30. Your thoughts? * {{#cache}} is now following the {{open addressing}} model. Any reasons why you moved to this model compared to your initial design? * {{#put()}} ** line 118: the first time cache fill .. shouldn't it be a new byte array name constructed from the passed in name? Why use the same caller passed in name? ** With the {{open addressing}} model, when you overwrite the cache slot with new names, there could be INodes which are already referring to this name and are cut from the cache. * I don't see any cache invalidation even when the INodes are removed. This takes up memory. Though not huge, design wise its not clean to leave the cache with stale values and incur cache lookup penalty in the future put() * {{#getSize()}} since there is no cache invalidation, and since this open addressing model, the size returned is not right. * line 149: {{cacheSizeFor}} is this roundUp or roundDown to the nearest 2 power. Please add the link to {{HashMap#tableSizeFor()}} in the comment to show where the code is inspired from. > Reimplement NameCache in NameNode: Intern duplicate byte[] arrays (mainly > those denoting file/directory names) to save memory > - > > Key: HDFS-12051 > URL: https://issues.apache.org/jira/browse/HDFS-12051 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Misha Dmitriev >Assignee: Misha Dmitriev >Priority: Major > Attachments: HDFS-12051.01.patch, HDFS-12051.02.patch, > HDFS-12051.03.patch, HDFS-12051.04.patch, HDFS-12051.05.patch, > HDFS-12051.06.patch, HDFS-12051.07.patch > > > When snapshot diff operation is performed in a NameNode that manages several > million HDFS files/directories, NN needs a lot of memory. Analyzing one heap > dump with jxray (www.jxray.com), we observed that duplicate byte[] arrays > result in 6.5% memory overhead, and most of these arrays are referenced by > {{org.apache.hadoop.hdfs.server.namenode.INodeFileAttributes$SnapshotCopy.name}} > and {{org.apache.hadoop.hdfs.server.namenode.INodeFile.name}}: > {code:java} > 19. DUPLICATE PRIMITIVE ARRAYS > Types of duplicate objects: > Ovhd Num objs Num unique objs Class name > 3,220,272K (6.5%) 104749528 25760871 byte[] > > 1,841,485K (3.7%), 53194037 dup arrays (13158094 unique) > 3510556 of byte[17](112, 97, 114, 116, 45, 109, 45, 48, 48, 48, ...), 2228255 > of byte[8](48, 48, 48, 48, 48, 48, 95, 48), 357439 of byte[17](112, 97, 114, > 116, 45, 109, 45, 48, 48, 48, ...), 237395 of byte[8](48, 48, 48, 48, 48, 49, > 95, 48), 227853 of byte[17](112, 97, 114, 116, 45, 109, 45, 48, 48, 48, ...), > 179193 of byte[17](112, 97, 114, 116, 45, 109, 45, 48, 48,
[jira] [Comment Edited] (HDFS-12051) Reimplement NameCache in NameNode: Intern duplicate byte[] arrays (mainly those denoting file/directory names) to save memory
[ https://issues.apache.org/jira/browse/HDFS-12051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16336383#comment-16336383 ] Yongjun Zhang edited comment on HDFS-12051 at 1/23/18 9:32 PM: --- Thanks for your comments [~szetszwo]. I thought you had reviewed the patch. Misha described his tests at: https://issues.apache.org/jira/browse/HDFS-12051?focusedCommentId=16084471&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16084471 and https://issues.apache.org/jira/browse/HDFS-12051?focusedCommentId=16329891&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16329891 I agree more testing is always a good thing. I did another round of review and found one thing I did not point out earlier (I'm sorry about that). The patch obsoleted the old config {code} public static final String DFS_NAMENODE_NAME_CACHE_THRESHOLD_KEY = "dfs.namenode.name.cache.threshold"; public static final int DFS_NAMENODE_NAME_CACHE_THRESHOLD_DEFAULT = 10; {code} and introduced new config {code} public static final String DFS_NAMENODE_NAME_CACHE_SIZE_KEY = "dfs.namenode.name.cache.size"; public static final int DFS_NAMENODE_NAME_CACHE_SIZE_DEFAULT = 4 * 1024 * 1024; {code} We should make this visible to the user saying that the former is deprecated because of the implementation change, and the new one can be used to config the new implementation. Would you please address this [~mi...@cloudera.com]? Since I'm the only one who has reviewed the patch, I'm inviting [~manojg] for a review (thanks Manoj). While he is reviewing, other folks are welcome to review. Thanks. was (Author: yzhangal): Thanks for your comments [~szetszwo]. I thought you had reviewed the patch. Misha described his tests at: https://issues.apache.org/jira/browse/HDFS-12051?focusedCommentId=16084471&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16084471 and https://issues.apache.org/jira/browse/HDFS-12051?focusedCommentId=16329891&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16329891 I agree more testing is always a good thing. I did another round of review and found one thing I did not point out earlier. The patch obsoleted the old config {code} public static final String DFS_NAMENODE_NAME_CACHE_THRESHOLD_KEY = "dfs.namenode.name.cache.threshold"; public static final int DFS_NAMENODE_NAME_CACHE_THRESHOLD_DEFAULT = 10; {code} and introduced new config {code} public static final String DFS_NAMENODE_NAME_CACHE_SIZE_KEY = "dfs.namenode.name.cache.size"; public static final int DFS_NAMENODE_NAME_CACHE_SIZE_DEFAULT = 4 * 1024 * 1024; {code} We should make this visible to the user saying that the former is deprecated because of the implementation change, and the new one can be used to config the new implementation. Would you please address this [~mi...@cloudera.com]? Since I'm the only one who has reviewed the patch, I'm inviting [~manojg] for a review (thanks Manoj). While he is reviewing, other folks are welcome to review. Thanks. > Reimplement NameCache in NameNode: Intern duplicate byte[] arrays (mainly > those denoting file/directory names) to save memory > - > > Key: HDFS-12051 > URL: https://issues.apache.org/jira/browse/HDFS-12051 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Misha Dmitriev >Assignee: Misha Dmitriev >Priority: Major > Attachments: HDFS-12051.01.patch, HDFS-12051.02.patch, > HDFS-12051.03.patch, HDFS-12051.04.patch, HDFS-12051.05.patch, > HDFS-12051.06.patch > > > When snapshot diff operation is performed in a NameNode that manages several > million HDFS files/directories, NN needs a lot of memory. Analyzing one heap > dump with jxray (www.jxray.com), we observed that duplicate byte[] arrays > result in 6.5% memory overhead, and most of these arrays are referenced by > {{org.apache.hadoop.hdfs.server.namenode.INodeFileAttributes$SnapshotCopy.name}} > and {{org.apache.hadoop.hdfs.server.namenode.INodeFile.name}}: > {code:java} > 19. DUPLICATE PRIMITIVE ARRAYS > Types of duplicate objects: > Ovhd Num objs Num unique objs Class name > 3,220,272K (6.5%) 104749528 25760871 byte[] > > 1,841,485K (3.7%), 53194037 dup arrays (13158094 unique) > 3510556 of byte[17](112, 97, 114, 116, 45, 109, 45, 48, 48, 48, ...), 2228255 > of byte[8](48, 48, 48, 48, 48, 48, 95, 48), 357439 of byte[17](112, 97, 114, > 116, 45, 109, 45, 48, 48, 48, ...), 237395 of byte[8](48, 48, 48, 48, 48, 49, > 95, 48), 227853 of byte[17](112, 97, 114, 116, 45, 109, 45, 48, 48, 48, ...), > 179193 of byte[17](112, 97, 114, 116, 45, 109, 4