> On Feb. 24, 2017, 6:09 p.m., Mohit Sabharwal wrote: > > common/src/java/org/apache/hadoop/hive/common/StringInternUtils.java, line > > 69 > > <https://reviews.apache.org/r/56687/diff/2/?file=1642999#file1642999line69> > > > > Nit: please follow hive coding conventions for if statements. Same in > > other places. > > (http://www.oracle.com/technetwork/java/javase/documentation/codeconventions-142311.html#431)
Done. > On Feb. 24, 2017, 6:09 p.m., Mohit Sabharwal wrote: > > ql/src/java/org/apache/hadoop/hive/ql/lockmgr/HiveLockObject.java, line 57 > > <https://reviews.apache.org/r/56687/diff/2/?file=1643005#file1643005line57> > > > > any point in interning a timestamp ? likelihood of this hitting the > > pool is almost zero, correct ? If I just looked at the code, I would think the same. I don't exactly understand why it happens, but analyzing the heap dump with jxray, I saw several per cent of the heap being wasted due to strings attached to HilveLockObject.lockTime. In fact, all the changes in this review are done based on these measurements - I haven't tried to guess anything. > On Feb. 24, 2017, 6:09 p.m., Mohit Sabharwal wrote: > > ql/src/java/org/apache/hadoop/hive/ql/metadata/Partition.java, lines 205-220 > > <https://reviews.apache.org/r/56687/diff/2/?file=1643006#file1643006line205> > > > > Do these paths eventually getting interned up the chain or are these > > ignored because these are aren't used/accessed in PartitionDesc ?...wasn't > > clear to me. That's a good question... I am not sure. I wonder if it's possible to somehow visualize the data flow between different parts of the code. But then I suspect that in too many cases it will look really complex, and it will be difficult to make any firm conclusions. All I can say is that I've done all my changes via a considerable number of iterations. I took a heap dump after an OOM, checked where duplicate strings came from and "plugged" all these locations. Then I reran my benchmark and took another heap dump. There were fewer duplicate strings now, but some new locations, that caused fewer duplicates or weren't previously visible for other reasons, showed up. So I repeated this cycle until basically all strings that I could de-dupe without changing e.g. HDFS and other library code, were fixed. But I used only one benchmark so far, and there is no firm guarantee that some other benchmark will not reveal other sources of duplicates. Fortunately, looks like we do have some other benchmarks at Cloudera, so I am looking forward to getting/analyzing some heap dumps from them. > On Feb. 24, 2017, 6:09 p.m., Mohit Sabharwal wrote: > > ql/src/java/org/apache/hadoop/hive/ql/metadata/Partition.java, lines 238-245 > > <https://reviews.apache.org/r/56687/diff/2/?file=1643006#file1643006line238> > > > > same for database name, table name strings accessed via > > MetaStoreUtils.getSchema -- getting interned someplace ? Same answer - I've interned everything that was worth interning according to the measurements in my benchmark. Over-interning always poses some risk of creating unnecessary work for the CPU. Based on the previous experience, I think 2-3 benchmarks that are reasonably diverse will allow us to reveal all the important sources of duplication. > On Feb. 24, 2017, 6:09 p.m., Mohit Sabharwal wrote: > > ql/src/java/org/apache/hadoop/hive/ql/metadata/Partition.java, lines 367-379 > > <https://reviews.apache.org/r/56687/diff/2/?file=1643006#file1643006line367> > > > > same for this. Same as above. > On Feb. 24, 2017, 6:09 p.m., Mohit Sabharwal wrote: > > ql/src/java/org/apache/hadoop/hive/ql/metadata/Table.java, line 91 > > <https://reviews.apache.org/r/56687/diff/2/?file=1643007#file1643007line91> > > > > what about this ? Same as above. Just give me more heap dumps! - Misha ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/56687/#review166735 ----------------------------------------------------------- On Feb. 23, 2017, 9:01 p.m., Misha Dmitriev wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/56687/ > ----------------------------------------------------------- > > (Updated Feb. 23, 2017, 9:01 p.m.) > > > Review request for hive, Chaoyu Tang, Mohit Sabharwal, and Sergio Pena. > > > Bugs: https://issues.apache.org/jira/browse/HIVE-15882 > > https://issues.apache.org/jira/browse/https://issues.apache.org/jira/browse/HIVE-15882 > > > Repository: hive-git > > > Description > ------- > > See the description of the problem in > https://issues.apache.org/jira/browse/HIVE-15882 Interning strings per this > review removes most of the overhead due to duplicate strings. > > Also, where maps in several places are created from other maps, use the > original map's size for the new map. This is to avoid the situation when a > map with default capacity (typically 16) is created to hold just 2-3 entries, > and the rest of the internal 16-entry array is wasted. > > > Diffs > ----- > > common/src/java/org/apache/hadoop/hive/common/StringInternUtils.java > PRE-CREATION > ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java > e81cbce3e333d44a4088c10491f399e92a505293 > ql/src/java/org/apache/hadoop/hive/ql/hooks/Entity.java > 08420664d59f28f75872c25c9f8ee42577b23451 > ql/src/java/org/apache/hadoop/hive/ql/io/CombineHiveInputFormat.java > e91064b9c75e8adb2b36f21ff19ec0c1539b03b9 > ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java > 51530ac16c92cc75d501bfcb573557754ba0c964 > ql/src/java/org/apache/hadoop/hive/ql/io/SymbolicInputFormat.java > 55b3b551a1dac92583b6e03b10beb8172ca93d45 > ql/src/java/org/apache/hadoop/hive/ql/lockmgr/HiveLockObject.java > 82dc89803be9cf9e0018720eeceb90ff450bfdc8 > ql/src/java/org/apache/hadoop/hive/ql/metadata/Partition.java > c0edde9e92314d86482b5c46178987e79fae57fe > ql/src/java/org/apache/hadoop/hive/ql/metadata/Table.java > c6ae6f290857cfd10f1023058ede99bf4a10f057 > ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java > 24d16812515bdfa90b4be7a295c0388fcdfe95ef > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/GenMRSkewJoinProcessor.java > ede4fcbe342052ad86dadebcc49da2c0f515ea98 > > ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/NullScanTaskDispatcher.java > 0882ae2c6205b1636cbc92e76ef66bb70faadc76 > > ql/src/java/org/apache/hadoop/hive/ql/plan/ConditionalResolverMergeFiles.java > 68b0ad9ea63f051f16fec3652d8525f7ab07eb3f > ql/src/java/org/apache/hadoop/hive/ql/plan/MapWork.java > d4bdd96eaf8d179bed43b8a8c3be0d338940154a > ql/src/java/org/apache/hadoop/hive/ql/plan/MsckDesc.java > b7a7e4b7a5f8941b080c7805d224d3885885f444 > ql/src/java/org/apache/hadoop/hive/ql/plan/PartitionDesc.java > 73981e826870139a42ad881103fdb0a2ef8433a2 > > Diff: https://reviews.apache.org/r/56687/diff/ > > > Testing > ------- > > I've measured how much memory this change plus another one (interning > Properties in PartitionDesc) save in my HS2 benchmark - the result is 37%. > See the details in HIVE-15882. > > > Thanks, > > Misha Dmitriev > >