[jira] [Commented] (HDFS-11383) String duplication in org.apache.hadoop.fs.BlockLocation

Misha Dmitriev (JIRA) Tue, 23 May 2017 19:05:31 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-11383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16022204#comment-16022204
 ]


Misha Dmitriev commented on HDFS-11383:
---------------------------------------

Hi Andrew,

I understand your concerns. Unit tests could be a good solution, but the 
problem is, to quantify the effect of a change like that one would need, in 
principle, to first run some code that uses BlockLocation unchanged and measure 
how much memory is consumed, then run the same code with BlockLocation that has 
interning and measure memory again. There is also a problem of how 
representative such a "pseudo-benchmark" would be, e.g. I can easily populate 
some data structure with very big strings and then demonstrate that interning 
them would save a lot of memory. But would that resemble real-life usage 
patterns?

So I suspect that some benchmark would be best, but indeed it's hard to revive 
my test cluster right now. Maybe I can still convince you by:
- telling that String.intern() is proven to work well (I've already optimized 
several projects at Cloudera with its help, and there I could definitely 
quantify the effect of the changes - we can discuss all this offline if you 
would like)
- attaching the results from my old benchmark showing how much memory is wasted 
due to duplicate strings in BlockLocation. I am attaching the full jxray report 
for one of the heap dumps that I obtained in this benchmark, and here are the 
most relevant excerpts:

{code}
6. DUPLICATE STRINGS

Total strings: 172,451  Unique strings: 52,360  Duplicate values: 16,158  
Overhead: 14,291K (29.8%)

Top duplicate strings:
    Ovhd         Num char[]s   Num objs   Value

  1,398K (2.9%)    12791       12791      "host-10-17-101-14.coe.cloudera.com"
  1,163K (2.4%)     9926        9926      
"host-10-17-101-14.coe.cloudera.com:8020"
    809K (1.7%)        6           6      
"hdfs://host-10-17-101-14.coe.cloudera.com:8020/tmp/misha/misha-table-partition-1,hdf
 ...[length 82892]"
    465K (1.0%)     9923        9923      "hdfs"
    ....

7. REFERENCE CHAINS FOR DUPLICATE STRINGS

  595K (1.2%), 5088 dup strings (4 unique), 5088 dup backing arrays:
1696 of "DS-aab6ab0b-0b11-489f-b209-ab2c6412934c", 1149 of 
"DS-d47bdaca-50c5-4475-ac08-7f07e10cd0b6", 1132 of 
"DS-bf6046e6-d5e9-4ac2-a1af-ff8a88ab9d85", 1111 of 
"DS-d2c5088c-bd69-4500-b981-502819c1307a"
     <-- String[] <-- org.apache.hadoop.fs.BlockLocation.storageIds <-- 
org.apache.hadoop.fs.BlockLocation[] <-- 
org.apache.hadoop.fs.LocatedFileStatus.locations <--  {j.u.ArrayList} <-- Java 
Local@fd414328 (j.u.ArrayList)
 
 556K (1.2%), 5088 dup strings (4 unique), 5088 dup backing arrays:
1696 of "host-10-17-101-14.coe.cloudera.com", 1149 of 
"host-10-17-101-15.coe.cloudera.com", 1132 of 
"host-10-17-101-17.coe.cloudera.com", 1111 of 
"host-10-17-101-16.coe.cloudera.com"
     <-- String[] <-- org.apache.hadoop.fs.BlockLocation.hosts <-- 
org.apache.hadoop.fs.BlockLocation[] <-- 
org.apache.hadoop.fs.LocatedFileStatus.locations <--  {j.u.ArrayList} <-- Java 
Local@fd414328 (j.u.ArrayList)

  476K (1.0%), 5088 dup strings (4 unique), 5088 dup backing arrays:
1696 of "/default/10.17.101.14:50010", 1149 of "/default/10.17.101.15:50010", 
1132 of "/default/10.17.101.17:50010", 1111 of "/default/10.17.101.16:50010"
     <-- String[] <-- org.apache.hadoop.fs.BlockLocation.topologyPaths <-- 
org.apache.hadoop.fs.BlockLocation[] <-- 
org.apache.hadoop.fs.LocatedFileStatus.locations <--  {j.u.ArrayList} <-- Java 
Local@fd414328 (j.u.ArrayList)

  409K (0.9%), 3492 dup strings (4 unique), 3492 dup backing arrays:
1164 of "DS-aab6ab0b-0b11-489f-b209-ab2c6412934c", 788 of 
"DS-d47bdaca-50c5-4475-ac08-7f07e10cd0b6", 770 of 
"DS-bf6046e6-d5e9-4ac2-a1af-ff8a88ab9d85", 770 of 
"DS-d2c5088c-bd69-4500-b981-502819c1307a"
     <-- String[] <-- org.apache.hadoop.fs.BlockLocation.storageIds <-- 
org.apache.hadoop.fs.BlockLocation[] <-- 
org.apache.hadoop.fs.LocatedFileStatus.locations <--  {j.u.ArrayList} <-- Java 
Local@fd67ae70 (j.u.ArrayList)

  397K (0.8%), 5088 dup strings (4 unique), 5088 dup backing arrays:
1696 of "10.17.101.14:50010", 1149 of "10.17.101.15:50010", 1132 of 
"10.17.101.17:50010", 1111 of "10.17.101.16:50010"
     <-- String[] <-- org.apache.hadoop.fs.BlockLocation.names <-- 
org.apache.hadoop.fs.BlockLocation[] <-- 
org.apache.hadoop.fs.LocatedFileStatus.locations <--  {j.u.ArrayList} <-- Java 
Local@fd414328 (j.u.ArrayList)

  381K (0.8%), 3492 dup strings (4 unique), 3492 dup backing arrays:
1164 of "host-10-17-101-14.coe.cloudera.com", 788 of 
"host-10-17-101-15.coe.cloudera.com", 770 of 
"host-10-17-101-17.coe.cloudera.com", 770 of 
"host-10-17-101-16.coe.cloudera.com"
     <-- String[] <-- org.apache.hadoop.fs.BlockLocation.hosts <-- 
org.apache.hadoop.fs.BlockLocation[] <-- 
org.apache.hadoop.fs.LocatedFileStatus.locations <--  {j.u.ArrayList} <-- Java 
Local@fd67ae70 (j.u.ArrayList)

....
{code}

> String duplication in org.apache.hadoop.fs.BlockLocation
> --------------------------------------------------------
>
>                 Key: HDFS-11383
>                 URL: https://issues.apache.org/jira/browse/HDFS-11383
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>            Reporter: Misha Dmitriev
>            Assignee: Misha Dmitriev
>         Attachments: HDFS-11383.01.patch
>
>
> I am working on Hive performance, investigating the problem of high memory 
> pressure when (a) a table consists of a high number (thousands) of partitions 
> and (b) multiple queries run against it concurrently. It turns out that a lot 
> of memory is wasted due to data duplication. One source of duplicate strings 
> is class org.apache.hadoop.fs.BlockLocation. Its fields such as storageIds, 
> topologyPaths, hosts, names, may collectively use up to 6% of memory in my 
> benchmark, causing (together with other problematic classes) a huge memory 
> spike. Of these 6% of memory taken by BlockLocation strings, more than 5% are 
> wasted due to duplication.
> I think we need to add calls to String.intern() in the BlockLocation 
> constructor, like:
> {code}
> this.hosts = internStringsInArray(hosts);
> ...
> private void internStringsInArray(String[] sar) {
>   for (int i = 0; i < sar.length; i++) {
>     sar[i] = sar[i].intern();
>   }
> }
> {code}
> String.intern() performs very well starting from JDK 7. I've found some 
> articles explaining the progress that was made by the HotSpot JVM developers 
> in this area, verified that with benchmarks myself, and finally added quite a 
> bit of interning to one of the Cloudera products without any issues.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-11383) String duplication in org.apache.hadoop.fs.BlockLocation

Reply via email to