[jira] [Comment Edited] (HDFS-12051) Reimplement NameCache in NameNode: Intern duplicate byte[] arrays (mainly those denoting file/directory names) to save memory

2018-01-25 Thread Misha Dmitriev (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340176#comment-16340176
 ] 

Misha Dmitriev edited comment on HDFS-12051 at 1/25/18 10:29 PM:
-

Thank you for the review, [~manojg] See my responses inline below.

{{NameCache.java}}
 * _line 97: {{cache = new byte[cacheSize][];}} Since this will take up a 
contiguous space, we need to restrict the cache size to much lesser size than 
your current MAX size of 1 << 30. Your thoughts?_

As you can see from line 78, the cache size is always set from the 
configuration, which provides a reasonable default, which is much, much smaller 
than 1<<30. It's up to the customer to increase this value if they need. If 
they have a huge heap, like 120GB (I've already heard of users approaching 
this!), then with 1<<30 it will result in an 8GB contiguous array. With a huge 
heap already, it is nothing wrong, if they really need it. But, anyway, if they 
decide to change this number, I think it's reasonable to expect them to have 
some understanding of what they are doing.

_{{#cache}} is now following the {{open addressing}} model. Any reasons why you 
moved to this model compared to your initial design?_

My own design for this cache has always been open addressing. The reason is 
that this is the most economical model in terms of memory: it uses just one 
pointer per cache entry, which is 8 bytes at most. If you use a cache with 
collision chains, like in java.util.HashMap, then each entry requires a pointer 
and a separate object (HashMap$Entry) This separate object takes at least 32 
bytes, so you end up with at least 40 bytes per entry - five times more!

Now, for a real HashMap, that needs to hold potentially very large number of 
objects, and needs to hold them all, the collision chain design may be 
justified in some cases. But for our specialized fixed-size cache, that strives 
to minimize its own memory overhead, the open addressing design is more 
appropriate.
 * _{{#put()}}_ 
 ** _line 118: the first time cache fill .. shouldn't it be a new byte array 
name constructed from the passed in name? Why use the same caller passed in 
name?_

The goal of this cache is to _avoid_ object duplication as much as possible. If 
the caller gave us a name X for which we don't have an existing copy, just 
remember X and return it. If on the next invocation they gave us Y and it turns 
out to be the same as X, return X again, and Y will be effectively lost and 
GCed soon.

 
 * _With the {{open addressing}} model, when you overwrite the cache slot with 
new names,  there could be INodes which are already referring to this name and 
are cut from the cache. Though their references are still valid, want  to 
understand why the preference given to new names compared to the old one._

The preference is given to new names simply because it's the lesser evil. We 
already discussed this with [~yzhangal] in the past. First, obviously when a 
cache entry is overwritten, the old INodes will just continue to refer to their 
old names, i.e. no information is lost. Second, all our solution details stem 
from the fact that we don't know in advance how many names we are going to 
have, and how many of them will be duplicate. Thus we want to have a fixed-size 
cache that will be guaranteed to not waste much memory if there is little 
duplication, but will provide a benefit and will save a lot of memory if there 
is considerable duplication.

Now, suppose we have a cache of size 3, and names come as follows: 'A', 'B', 
'C', 'D', 'D', 'D', ... The cache would be full after the first 3 names. If 
after that we don't override one of the entries to accomodate 'D', we will not 
realize any savings from deduplicating all the subsequent 'D's. To be fair, if 
this cache receives something like 'A', 'B', 'C', 'D', 'E', 'F', 'A', 'B', 'C', 
'D', 'E', 'F' - then it just gets rewritten all the time and provides no 
benefit. But in practice (and I have already implemented a similar cache in 
several other projects), I've never observed such a pathology. With a 
reasonable-size cache and real-life data, it always works.
 * _I don't see any cache invalidation even when the INodes are removed. This 
takes up memory. Though not huge, design wise its not clean to leave the cache 
with stale values and incur cache lookup penalty in the future put()_ 

This cache by default takes just 16MB, which is 0.1% of 16GB, which is on the 
smaller side of NN heap size spectrum. So any losses due to stale cache entries 
are pretty negligible. Furthermore, the above-mentioned overwriting of cache 
entries when new data is coming also helps to keep the cache reasonably "fresh".
 * _{{#getSize()}} since there is no cache invalidation, and since this open 
addressing model, the size returned is not right._

As the javadoc for this method explains, this method may return 

[jira] [Comment Edited] (HDFS-12051) Reimplement NameCache in NameNode: Intern duplicate byte[] arrays (mainly those denoting file/directory names) to save memory

2018-01-24 Thread Manoj Govindassamy (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338473#comment-16338473
 ] 

Manoj Govindassamy edited comment on HDFS-12051 at 1/25/18 12:23 AM:
-

Thanks for working on this [~mi...@cloudera.com]. Few comments on 
HDFS-12051.07.patch

{{NameCache.java}}
 * line 97: {{cache = new byte[cacheSize][];}} Since this will take up a 
contiguous space, we need to restrict the cache size to much lesser size than 
your current MAX size of 1 << 30. Your thoughts?
 * {{#cache}} is now following the {{open addressing}} model. Any reasons why 
you moved to this model compared to your initial design?
 * {{#put()}} 
 ** line 118: the first time cache fill .. shouldn't it be a new byte array 
name constructed from the passed in name? Why use the same caller passed in 
name?
 ** With the {{open addressing}} model, when you overwrite the cache slot with 
new names,  there could be INodes which are already referring to this name and 
are cut from the cache. Though their references are still valid, want  to 
understand why the preference given to new names compared to the old one.
 * I don't see any cache invalidation even when the INodes are removed. This 
takes up memory. Though not huge, design wise its not clean to leave the cache 
with stale values and incur cache lookup penalty in the future put() 
 * {{#getSize()}} since there is no cache invalidation, and since this open 
addressing model, the size returned is not right.
 * line 149: {{cacheSizeFor}} is this roundUp or roundDown to the nearest 2 
power. Please add the link to {{HashMap#tableSizeFor()}} in the comment to show 
where the code is inspired from.


was (Author: manojg):
Thanks for working on this [~mi...@cloudera.com]. Few comments on 
HDFS-12051.07.patch

{{NameCache.java}}
 * line 97: {{cache = new byte[cacheSize][];}} Since this will take up a 
contiguous space, we need to restrict the cache size to much lesser size than 
your current MAX size of 1 << 30. Your thoughts?
 * {{#cache}} is now following the {{open addressing}} model. Any reasons why 
you moved to this model compared to your initial design?
 * {{#put()}} 
 ** line 118: the first time cache fill .. shouldn't it be a new byte array 
name constructed from the passed in name? Why use the same caller passed in 
name?
 ** With the {{open addressing}} model, when you overwrite the cache slot with 
new names,  there could be INodes which are already referring to this name and 
are cut from the cache. 
 * I don't see any cache invalidation even when the INodes are removed. This 
takes up memory. Though not huge, design wise its not clean to leave the cache 
with stale values and incur cache lookup penalty in the future put() 
 * {{#getSize()}} since there is no cache invalidation, and since this open 
addressing model, the size returned is not right.
 * line 149: {{cacheSizeFor}} is this roundUp or roundDown to the nearest 2 
power. Please add the link to {{HashMap#tableSizeFor()}} in the comment to show 
where the code is inspired from.

> Reimplement NameCache in NameNode: Intern duplicate byte[] arrays (mainly 
> those denoting file/directory names) to save memory
> -
>
> Key: HDFS-12051
> URL: https://issues.apache.org/jira/browse/HDFS-12051
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Misha Dmitriev
>Assignee: Misha Dmitriev
>Priority: Major
> Attachments: HDFS-12051.01.patch, HDFS-12051.02.patch, 
> HDFS-12051.03.patch, HDFS-12051.04.patch, HDFS-12051.05.patch, 
> HDFS-12051.06.patch, HDFS-12051.07.patch
>
>
> When snapshot diff operation is performed in a NameNode that manages several 
> million HDFS files/directories, NN needs a lot of memory. Analyzing one heap 
> dump with jxray (www.jxray.com), we observed that duplicate byte[] arrays 
> result in 6.5% memory overhead, and most of these arrays are referenced by 
> {{org.apache.hadoop.hdfs.server.namenode.INodeFileAttributes$SnapshotCopy.name}}
>  and {{org.apache.hadoop.hdfs.server.namenode.INodeFile.name}}:
> {code:java}
> 19. DUPLICATE PRIMITIVE ARRAYS
> Types of duplicate objects:
>  Ovhd Num objs  Num unique objs   Class name
> 3,220,272K (6.5%)   104749528  25760871 byte[]
> 
>   1,841,485K (3.7%), 53194037 dup arrays (13158094 unique)
> 3510556 of byte[17](112, 97, 114, 116, 45, 109, 45, 48, 48, 48, ...), 2228255 
> of byte[8](48, 48, 48, 48, 48, 48, 95, 48), 357439 of byte[17](112, 97, 114, 
> 116, 45, 109, 45, 48, 48, 48, ...), 237395 of byte[8](48, 48, 48, 48, 48, 49, 
> 95, 48), 227853 of byte[17](112, 97, 114, 116, 45, 109, 45, 48, 48, 48, ...), 
> 179193 of byte[17](112, 97, 114, 116, 45, 109, 45, 48, 48,

[jira] [Comment Edited] (HDFS-12051) Reimplement NameCache in NameNode: Intern duplicate byte[] arrays (mainly those denoting file/directory names) to save memory

2018-01-23 Thread Yongjun Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16336383#comment-16336383
 ] 

Yongjun Zhang edited comment on HDFS-12051 at 1/23/18 9:32 PM:
---

Thanks for your comments [~szetszwo]. I thought you had reviewed the patch. 

Misha described his tests at:

https://issues.apache.org/jira/browse/HDFS-12051?focusedCommentId=16084471&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16084471

and

https://issues.apache.org/jira/browse/HDFS-12051?focusedCommentId=16329891&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16329891

I agree more testing is always a good thing.

I did another round of review and found one thing I did not point out earlier 
(I'm sorry about that). The patch obsoleted the old config 
{code}
public static final String DFS_NAMENODE_NAME_CACHE_THRESHOLD_KEY = 
"dfs.namenode.name.cache.threshold";
public static final int DFS_NAMENODE_NAME_CACHE_THRESHOLD_DEFAULT = 10;
{code}
and introduced new config 
{code}
public static final String  DFS_NAMENODE_NAME_CACHE_SIZE_KEY = 
"dfs.namenode.name.cache.size";
public static final int DFS_NAMENODE_NAME_CACHE_SIZE_DEFAULT = 4 * 1024 * 
1024;
{code}
We should make this visible to the user saying that the former is deprecated 
because of the implementation change, and the new one can be used to config the 
new implementation.

Would you please address this [~mi...@cloudera.com]?

Since I'm the only one who has reviewed the patch, I'm inviting [~manojg] for a 
review (thanks Manoj). While he is reviewing, other folks are welcome to review.

Thanks.




was (Author: yzhangal):
Thanks for your comments [~szetszwo]. I thought you had reviewed the patch. 

Misha described his tests at:

https://issues.apache.org/jira/browse/HDFS-12051?focusedCommentId=16084471&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16084471

and

https://issues.apache.org/jira/browse/HDFS-12051?focusedCommentId=16329891&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16329891

I agree more testing is always a good thing.

I did another round of review and found one thing I did not point out earlier. 
The patch obsoleted the old config 
{code}
public static final String DFS_NAMENODE_NAME_CACHE_THRESHOLD_KEY = 
"dfs.namenode.name.cache.threshold";
public static final int DFS_NAMENODE_NAME_CACHE_THRESHOLD_DEFAULT = 10;
{code}
and introduced new config 
{code}
public static final String  DFS_NAMENODE_NAME_CACHE_SIZE_KEY = 
"dfs.namenode.name.cache.size";
public static final int DFS_NAMENODE_NAME_CACHE_SIZE_DEFAULT = 4 * 1024 * 
1024;
{code}
We should make this visible to the user saying that the former is deprecated 
because of the implementation change, and the new one can be used to config the 
new implementation.

Would you please address this [~mi...@cloudera.com]?

Since I'm the only one who has reviewed the patch, I'm inviting [~manojg] for a 
review (thanks Manoj). While he is reviewing, other folks are welcome to review.

Thanks.



> Reimplement NameCache in NameNode: Intern duplicate byte[] arrays (mainly 
> those denoting file/directory names) to save memory
> -
>
> Key: HDFS-12051
> URL: https://issues.apache.org/jira/browse/HDFS-12051
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Misha Dmitriev
>Assignee: Misha Dmitriev
>Priority: Major
> Attachments: HDFS-12051.01.patch, HDFS-12051.02.patch, 
> HDFS-12051.03.patch, HDFS-12051.04.patch, HDFS-12051.05.patch, 
> HDFS-12051.06.patch
>
>
> When snapshot diff operation is performed in a NameNode that manages several 
> million HDFS files/directories, NN needs a lot of memory. Analyzing one heap 
> dump with jxray (www.jxray.com), we observed that duplicate byte[] arrays 
> result in 6.5% memory overhead, and most of these arrays are referenced by 
> {{org.apache.hadoop.hdfs.server.namenode.INodeFileAttributes$SnapshotCopy.name}}
>  and {{org.apache.hadoop.hdfs.server.namenode.INodeFile.name}}:
> {code:java}
> 19. DUPLICATE PRIMITIVE ARRAYS
> Types of duplicate objects:
>  Ovhd Num objs  Num unique objs   Class name
> 3,220,272K (6.5%)   104749528  25760871 byte[]
> 
>   1,841,485K (3.7%), 53194037 dup arrays (13158094 unique)
> 3510556 of byte[17](112, 97, 114, 116, 45, 109, 45, 48, 48, 48, ...), 2228255 
> of byte[8](48, 48, 48, 48, 48, 48, 95, 48), 357439 of byte[17](112, 97, 114, 
> 116, 45, 109, 45, 48, 48, 48, ...), 237395 of byte[8](48, 48, 48, 48, 48, 49, 
> 95, 48), 227853 of byte[17](112, 97, 114, 116, 45, 109, 45, 48, 48, 48, ...), 
> 179193 of byte[17](112, 97, 114, 116, 45, 109, 4