[ https://issues.apache.org/jira/browse/HDFS-11383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Wang updated HDFS-11383: ------------------------------- Summary: Intern strings in BlockLocation and ExtendedBlock (was: String duplication in org.apache.hadoop.fs.BlockLocation) > Intern strings in BlockLocation and ExtendedBlock > ------------------------------------------------- > > Key: HDFS-11383 > URL: https://issues.apache.org/jira/browse/HDFS-11383 > Project: Hadoop HDFS > Issue Type: Improvement > Reporter: Misha Dmitriev > Assignee: Misha Dmitriev > Fix For: 2.9.0, 3.0.0-alpha4 > > Attachments: HDFS-11383.01.patch, HDFS-11383.02.patch, > HDFS-11383.03.patch, HDFS-11383.04.patch, hs2-crash-2.txt > > > I am working on Hive performance, investigating the problem of high memory > pressure when (a) a table consists of a high number (thousands) of partitions > and (b) multiple queries run against it concurrently. It turns out that a lot > of memory is wasted due to data duplication. One source of duplicate strings > is class org.apache.hadoop.fs.BlockLocation. Its fields such as storageIds, > topologyPaths, hosts, names, may collectively use up to 6% of memory in my > benchmark, causing (together with other problematic classes) a huge memory > spike. Of these 6% of memory taken by BlockLocation strings, more than 5% are > wasted due to duplication. > I think we need to add calls to String.intern() in the BlockLocation > constructor, like: > {code} > this.hosts = internStringsInArray(hosts); > ... > private void internStringsInArray(String[] sar) { > for (int i = 0; i < sar.length; i++) { > sar[i] = sar[i].intern(); > } > } > {code} > String.intern() performs very well starting from JDK 7. I've found some > articles explaining the progress that was made by the HotSpot JVM developers > in this area, verified that with benchmarks myself, and finally added quite a > bit of interning to one of the Cloudera products without any issues. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org