Re: Review Request 56687: Intern strings in various critical places to reduce memory consumption.

Misha Dmitriev Fri, 24 Feb 2017 11:58:40 -0800


> On Feb. 24, 2017, 6:09 p.m., Mohit Sabharwal wrote:
> > common/src/java/org/apache/hadoop/hive/common/StringInternUtils.java, line 
> > 69
> > <https://reviews.apache.org/r/56687/diff/2/?file=1642999#file1642999line69>
> >
> >     Nit: please follow hive coding conventions for if statements. Same in 
> > other places. 
> > (http://www.oracle.com/technetwork/java/javase/documentation/codeconventions-142311.html#431)


Done.


> On Feb. 24, 2017, 6:09 p.m., Mohit Sabharwal wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/lockmgr/HiveLockObject.java, line 57
> > <https://reviews.apache.org/r/56687/diff/2/?file=1643005#file1643005line57>
> >
> >     any point in interning a timestamp ? likelihood of this hitting the 
> > pool is almost zero, correct ?

If I just looked at the code, I would think the same. I don't exactly 
understand why it happens, but analyzing the heap dump with jxray, I saw 
several per cent of the heap being wasted due to strings attached to 
HilveLockObject.lockTime. In fact, all the changes in this review are done 
based on these measurements - I haven't tried to guess anything.


> On Feb. 24, 2017, 6:09 p.m., Mohit Sabharwal wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/metadata/Partition.java, lines 205-220
> > <https://reviews.apache.org/r/56687/diff/2/?file=1643006#file1643006line205>
> >
> >     Do these paths eventually getting interned up the chain or are these 
> > ignored because these are aren't used/accessed in PartitionDesc ?...wasn't 
> > clear to me.

That's a good question... I am not sure. I wonder if it's possible to somehow 
visualize the data flow between different parts of the code. But then I suspect 
that in too many cases it will look really complex, and it will be difficult to 
make any firm conclusions.

All I can say is that I've done all my changes via a considerable number of 
iterations. I took a heap dump after an OOM, checked where duplicate strings 
came from and "plugged" all these locations. Then I reran my benchmark and took 
another heap dump. There were fewer duplicate strings now, but some new 
locations, that caused fewer duplicates or weren't previously visible for other 
reasons, showed up. So I repeated this cycle until basically all strings that I 
could de-dupe without changing e.g. HDFS and other library code, were fixed. 
But I used only one benchmark so far, and there is no firm guarantee that some 
other benchmark will not reveal other sources of duplicates. Fortunately, looks 
like we do have some other benchmarks at Cloudera, so I am looking forward to 
getting/analyzing some heap dumps from them.


> On Feb. 24, 2017, 6:09 p.m., Mohit Sabharwal wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/metadata/Partition.java, lines 238-245
> > <https://reviews.apache.org/r/56687/diff/2/?file=1643006#file1643006line238>
> >
> >     same for database name, table name strings accessed via 
> > MetaStoreUtils.getSchema -- getting interned someplace ?

Same answer - I've interned everything that was worth interning according to 
the measurements in my benchmark. Over-interning always poses some risk of 
creating unnecessary work for the CPU. Based on the previous experience, I 
think 2-3 benchmarks that are reasonably diverse will allow us to reveal all 
the important sources of duplication.


> On Feb. 24, 2017, 6:09 p.m., Mohit Sabharwal wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/metadata/Partition.java, lines 367-379
> > <https://reviews.apache.org/r/56687/diff/2/?file=1643006#file1643006line367>
> >
> >     same for this.

Same as above.


> On Feb. 24, 2017, 6:09 p.m., Mohit Sabharwal wrote:
> > ql/src/java/org/apache/hadoop/hive/ql/metadata/Table.java, line 91
> > <https://reviews.apache.org/r/56687/diff/2/?file=1643007#file1643007line91>
> >
> >     what about this ?

Same as above. Just give me more heap dumps!


- Misha


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/56687/#review166735
-----------------------------------------------------------


On Feb. 23, 2017, 9:01 p.m., Misha Dmitriev wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/56687/
> -----------------------------------------------------------
> 
> (Updated Feb. 23, 2017, 9:01 p.m.)
> 
> 
> Review request for hive, Chaoyu Tang, Mohit Sabharwal, and Sergio Pena.
> 
> 
> Bugs: https://issues.apache.org/jira/browse/HIVE-15882
>     
> https://issues.apache.org/jira/browse/https://issues.apache.org/jira/browse/HIVE-15882
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> See the description of the problem in 
> https://issues.apache.org/jira/browse/HIVE-15882 Interning strings per this 
> review removes most of the overhead due to duplicate strings.
> 
> Also, where maps in several places are created from other maps, use the 
> original map's size for the new map. This is to avoid the situation when a 
> map with default capacity (typically 16) is created to hold just 2-3 entries, 
> and the rest of the internal 16-entry array is wasted.
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/common/StringInternUtils.java 
> PRE-CREATION 
>   ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 
> e81cbce3e333d44a4088c10491f399e92a505293 
>   ql/src/java/org/apache/hadoop/hive/ql/hooks/Entity.java 
> 08420664d59f28f75872c25c9f8ee42577b23451 
>   ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java 
> e91064b9c75e8adb2b36f21ff19ec0c1539b03b9 
>   ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java 
> 51530ac16c92cc75d501bfcb573557754ba0c964 
>   ql/src/java/org/apache/hadoop/hive/ql/io/SymbolicInputFormat.java 
> 55b3b551a1dac92583b6e03b10beb8172ca93d45 
>   ql/src/java/org/apache/hadoop/hive/ql/lockmgr/HiveLockObject.java 
> 82dc89803be9cf9e0018720eeceb90ff450bfdc8 
>   ql/src/java/org/apache/hadoop/hive/ql/metadata/Partition.java 
> c0edde9e92314d86482b5c46178987e79fae57fe 
>   ql/src/java/org/apache/hadoop/hive/ql/metadata/Table.java 
> c6ae6f290857cfd10f1023058ede99bf4a10f057 
>   ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java 
> 24d16812515bdfa90b4be7a295c0388fcdfe95ef 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/GenMRSkewJoinProcessor.java
>  ede4fcbe342052ad86dadebcc49da2c0f515ea98 
>   
> ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/NullScanTaskDispatcher.java
>  0882ae2c6205b1636cbc92e76ef66bb70faadc76 
>   
> ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverMergeFiles.java 
> 68b0ad9ea63f051f16fec3652d8525f7ab07eb3f 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/MapWork.java 
> d4bdd96eaf8d179bed43b8a8c3be0d338940154a 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/MsckDesc.java 
> b7a7e4b7a5f8941b080c7805d224d3885885f444 
>   ql/src/java/org/apache/hadoop/hive/ql/plan/PartitionDesc.java 
> 73981e826870139a42ad881103fdb0a2ef8433a2 
> 
> Diff: https://reviews.apache.org/r/56687/diff/
> 
> 
> Testing
> -------
> 
> I've measured how much memory this change plus another one (interning 
> Properties in PartitionDesc) save in my HS2 benchmark - the result is 37%. 
> See the details in HIVE-15882.
> 
> 
> Thanks,
> 
> Misha Dmitriev
> 
>

Re: Review Request 56687: Intern strings in various critical places to reduce memory consumption.

Reply via email to