Re: Review Request 56687: Intern strings in various critical places to reduce memory consumption.

Misha Dmitriev Fri, 24 Feb 2017 10:47:00 -0800


> On Feb. 24, 2017, 7:38 a.m., Rui Li wrote:
> > common/src/java/org/apache/hadoop/hive/common/StringInternUtils.java, line 
> > 67
> > <https://reviews.apache.org/r/56687/diff/2/?file=1642999#file1642999line67>
> >
> >     why the stringField doesn't need the null check like other fields?


There are strong indicators in the URI.java code that this field is never null. 
For example, the javadoc says it's the only serializable field. However, it's 
hard to prove this with 100% accuracy, and it might change in the future. So, 
added a null check for safety.


> On Feb. 24, 2017, 7:38 a.m., Rui Li wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java, line 3147
> > <https://reviews.apache.org/r/56687/diff/2/?file=1643000#file1643000line3147>
> >
> >     How about intern the path in the createEmptyFile method?

Makes sense, done.


> On Feb. 24, 2017, 7:38 a.m., Rui Li wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/lockmgr/HiveLockObject.java, line 183
> > <https://reviews.apache.org/r/56687/diff/2/?file=1643005#file1643005line183>
> >
> >     can we call the util method?

Done.


> On Feb. 24, 2017, 7:38 a.m., Rui Li wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/lockmgr/HiveLockObject.java, line 188
> > <https://reviews.apache.org/r/56687/diff/2/?file=1643005#file1643005line188>
> >
> >     guess we can also add a util method for this

Good idea, done.


> On Feb. 24, 2017, 7:38 a.m., Rui Li wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java, line 
> > 253
> > <https://reviews.apache.org/r/56687/diff/2/?file=1643008#file1643008line253>
> >
> >     since we'll intern strings in the new path, do we have to intern 
> > taskTmpDir here?

We need to - you can see that this String object is also added to several 
tables below on its own, not as part of taskTmpDirPath.


> On Feb. 24, 2017, 7:38 a.m., Rui Li wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverMergeFiles.java,
> >  line 322
> > <https://reviews.apache.org/r/56687/diff/2/?file=1643011#file1643011line322>
> >
> >     will this cause the hash map to resize since the default load factor is 
> > 0.75? and several similar concerns below

You are probably right, in that this constructor's parameter is the initial 
capacity of this table (more or less the size of the internal array) - not how 
many elements the table is expected to hold. However, if you check the code of 
HashMap, the things are more interesting. The actual capacity of the table is 
always a power of two, so unless this parameter is also a power of two, the 
capacity will be chosen as the nearest higher power of two, i.e. it will be 
higher than the parameter and closer to what we actually need. Also, if we 
create a table with the default size (16) here and then will put many more 
elements into it, it will be resized several times, whereas with the current 
code it will be resized at most once. Trying to "factor in" the load factor 
will likely add more confusion/complexity. All in all, given that choosing 
capacity in HashMap internally is non-trivial, I think it's easier/safer to 
just call 'new HashMap(oldMap.size())' as we do now.


- Misha


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56687/#review166649
-----------------------------------------------------------


On Feb. 23, 2017, 9:01 p.m., Misha Dmitriev wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56687/
> -----------------------------------------------------------
> 
> (Updated Feb. 23, 2017, 9:01 p.m.)
> 
> 
> Review request for hive, Chaoyu Tang, Mohit Sabharwal, and Sergio Pena.
> 
> 
> Bugs: https://issues.apache.org/jira/browse/HIVE-15882
>     
> https://issues.apache.org/jira/browse/https://issues.apache.org/jira/browse/HIVE-15882
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> See the description of the problem in 
> https://issues.apache.org/jira/browse/HIVE-15882 Interning strings per this 
> review removes most of the overhead due to duplicate strings.
> 
> Also, where maps in several places are created from other maps, use the 
> original map's size for the new map. This is to avoid the situation when a 
> map with default capacity (typically 16) is created to hold just 2-3 entries, 
> and the rest of the internal 16-entry array is wasted.
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/common/StringInternUtils.java 
> PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 
> e81cbce3e333d44a4088c10491f399e92a505293 
>   ql/src/java/org/apache/hadoop/hive/ql/hooks/Entity.java 
> 08420664d59f28f75872c25c9f8ee42577b23451 
>   ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 
> e91064b9c75e8adb2b36f21ff19ec0c1539b03b9 
>   ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java 
> 51530ac16c92cc75d501bfcb573557754ba0c964 
>   ql/src/java/org/apache/hadoop/hive/ql/io/SymbolicInputFormat.java 
> 55b3b551a1dac92583b6e03b10beb8172ca93d45 
>   ql/src/java/org/apache/hadoop/hive/ql/lockmgr/HiveLockObject.java 
> 82dc89803be9cf9e0018720eeceb90ff450bfdc8 
>   ql/src/java/org/apache/hadoop/hive/ql/metadata/Partition.java 
> c0edde9e92314d86482b5c46178987e79fae57fe 
>   ql/src/java/org/apache/hadoop/hive/ql/metadata/Table.java 
> c6ae6f290857cfd10f1023058ede99bf4a10f057 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 
> 24d16812515bdfa90b4be7a295c0388fcdfe95ef 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/GenMRSkewJoinProcessor.java
>  ede4fcbe342052ad86dadebcc49da2c0f515ea98 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/NullScanTaskDispatcher.java
>  0882ae2c6205b1636cbc92e76ef66bb70faadc76 
>   
> ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverMergeFiles.java 
> 68b0ad9ea63f051f16fec3652d8525f7ab07eb3f 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/MapWork.java 
> d4bdd96eaf8d179bed43b8a8c3be0d338940154a 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/MsckDesc.java 
> b7a7e4b7a5f8941b080c7805d224d3885885f444 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/PartitionDesc.java 
> 73981e826870139a42ad881103fdb0a2ef8433a2 
> 
> Diff: https://reviews.apache.org/r/56687/diff/
> 
> 
> Testing
> -------
> 
> I've measured how much memory this change plus another one (interning 
> Properties in PartitionDesc) save in my HS2 benchmark - the result is 37%. 
> See the details in HIVE-15882.
> 
> 
> Thanks,
> 
> Misha Dmitriev
> 
>

Re: Review Request 56687: Intern strings in various critical places to reduce memory consumption.

Reply via email to