[ https://issues.apache.org/jira/browse/HDFS-11383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16022204#comment-16022204 ]
Misha Dmitriev commented on HDFS-11383: --------------------------------------- Hi Andrew, I understand your concerns. Unit tests could be a good solution, but the problem is, to quantify the effect of a change like that one would need, in principle, to first run some code that uses BlockLocation unchanged and measure how much memory is consumed, then run the same code with BlockLocation that has interning and measure memory again. There is also a problem of how representative such a "pseudo-benchmark" would be, e.g. I can easily populate some data structure with very big strings and then demonstrate that interning them would save a lot of memory. But would that resemble real-life usage patterns? So I suspect that some benchmark would be best, but indeed it's hard to revive my test cluster right now. Maybe I can still convince you by: - telling that String.intern() is proven to work well (I've already optimized several projects at Cloudera with its help, and there I could definitely quantify the effect of the changes - we can discuss all this offline if you would like) - attaching the results from my old benchmark showing how much memory is wasted due to duplicate strings in BlockLocation. I am attaching the full jxray report for one of the heap dumps that I obtained in this benchmark, and here are the most relevant excerpts: {code} 6. DUPLICATE STRINGS Total strings: 172,451 Unique strings: 52,360 Duplicate values: 16,158 Overhead: 14,291K (29.8%) Top duplicate strings: Ovhd Num char[]s Num objs Value 1,398K (2.9%) 12791 12791 "host-10-17-101-14.coe.cloudera.com" 1,163K (2.4%) 9926 9926 "host-10-17-101-14.coe.cloudera.com:8020" 809K (1.7%) 6 6 "hdfs://host-10-17-101-14.coe.cloudera.com:8020/tmp/misha/misha-table-partition-1,hdf ...[length 82892]" 465K (1.0%) 9923 9923 "hdfs" .... 7. REFERENCE CHAINS FOR DUPLICATE STRINGS 595K (1.2%), 5088 dup strings (4 unique), 5088 dup backing arrays: 1696 of "DS-aab6ab0b-0b11-489f-b209-ab2c6412934c", 1149 of "DS-d47bdaca-50c5-4475-ac08-7f07e10cd0b6", 1132 of "DS-bf6046e6-d5e9-4ac2-a1af-ff8a88ab9d85", 1111 of "DS-d2c5088c-bd69-4500-b981-502819c1307a" <-- String[] <-- org.apache.hadoop.fs.BlockLocation.storageIds <-- org.apache.hadoop.fs.BlockLocation[] <-- org.apache.hadoop.fs.LocatedFileStatus.locations <-- {j.u.ArrayList} <-- Java Local@fd414328 (j.u.ArrayList) 556K (1.2%), 5088 dup strings (4 unique), 5088 dup backing arrays: 1696 of "host-10-17-101-14.coe.cloudera.com", 1149 of "host-10-17-101-15.coe.cloudera.com", 1132 of "host-10-17-101-17.coe.cloudera.com", 1111 of "host-10-17-101-16.coe.cloudera.com" <-- String[] <-- org.apache.hadoop.fs.BlockLocation.hosts <-- org.apache.hadoop.fs.BlockLocation[] <-- org.apache.hadoop.fs.LocatedFileStatus.locations <-- {j.u.ArrayList} <-- Java Local@fd414328 (j.u.ArrayList) 476K (1.0%), 5088 dup strings (4 unique), 5088 dup backing arrays: 1696 of "/default/10.17.101.14:50010", 1149 of "/default/10.17.101.15:50010", 1132 of "/default/10.17.101.17:50010", 1111 of "/default/10.17.101.16:50010" <-- String[] <-- org.apache.hadoop.fs.BlockLocation.topologyPaths <-- org.apache.hadoop.fs.BlockLocation[] <-- org.apache.hadoop.fs.LocatedFileStatus.locations <-- {j.u.ArrayList} <-- Java Local@fd414328 (j.u.ArrayList) 409K (0.9%), 3492 dup strings (4 unique), 3492 dup backing arrays: 1164 of "DS-aab6ab0b-0b11-489f-b209-ab2c6412934c", 788 of "DS-d47bdaca-50c5-4475-ac08-7f07e10cd0b6", 770 of "DS-bf6046e6-d5e9-4ac2-a1af-ff8a88ab9d85", 770 of "DS-d2c5088c-bd69-4500-b981-502819c1307a" <-- String[] <-- org.apache.hadoop.fs.BlockLocation.storageIds <-- org.apache.hadoop.fs.BlockLocation[] <-- org.apache.hadoop.fs.LocatedFileStatus.locations <-- {j.u.ArrayList} <-- Java Local@fd67ae70 (j.u.ArrayList) 397K (0.8%), 5088 dup strings (4 unique), 5088 dup backing arrays: 1696 of "10.17.101.14:50010", 1149 of "10.17.101.15:50010", 1132 of "10.17.101.17:50010", 1111 of "10.17.101.16:50010" <-- String[] <-- org.apache.hadoop.fs.BlockLocation.names <-- org.apache.hadoop.fs.BlockLocation[] <-- org.apache.hadoop.fs.LocatedFileStatus.locations <-- {j.u.ArrayList} <-- Java Local@fd414328 (j.u.ArrayList) 381K (0.8%), 3492 dup strings (4 unique), 3492 dup backing arrays: 1164 of "host-10-17-101-14.coe.cloudera.com", 788 of "host-10-17-101-15.coe.cloudera.com", 770 of "host-10-17-101-17.coe.cloudera.com", 770 of "host-10-17-101-16.coe.cloudera.com" <-- String[] <-- org.apache.hadoop.fs.BlockLocation.hosts <-- org.apache.hadoop.fs.BlockLocation[] <-- org.apache.hadoop.fs.LocatedFileStatus.locations <-- {j.u.ArrayList} <-- Java Local@fd67ae70 (j.u.ArrayList) .... {code} > String duplication in org.apache.hadoop.fs.BlockLocation > -------------------------------------------------------- > > Key: HDFS-11383 > URL: https://issues.apache.org/jira/browse/HDFS-11383 > Project: Hadoop HDFS > Issue Type: Improvement > Reporter: Misha Dmitriev > Assignee: Misha Dmitriev > Attachments: HDFS-11383.01.patch > > > I am working on Hive performance, investigating the problem of high memory > pressure when (a) a table consists of a high number (thousands) of partitions > and (b) multiple queries run against it concurrently. It turns out that a lot > of memory is wasted due to data duplication. One source of duplicate strings > is class org.apache.hadoop.fs.BlockLocation. Its fields such as storageIds, > topologyPaths, hosts, names, may collectively use up to 6% of memory in my > benchmark, causing (together with other problematic classes) a huge memory > spike. Of these 6% of memory taken by BlockLocation strings, more than 5% are > wasted due to duplication. > I think we need to add calls to String.intern() in the BlockLocation > constructor, like: > {code} > this.hosts = internStringsInArray(hosts); > ... > private void internStringsInArray(String[] sar) { > for (int i = 0; i < sar.length; i++) { > sar[i] = sar[i].intern(); > } > } > {code} > String.intern() performs very well starting from JDK 7. I've found some > articles explaining the progress that was made by the HotSpot JVM developers > in this area, verified that with benchmarks myself, and finally added quite a > bit of interning to one of the Cloudera products without any issues. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org