[jira] [Commented] (HIVE-16166) HS2 may still waste up to 15% of memory on duplicate strings

Misha Dmitriev (JIRA) Fri, 17 Mar 2017 22:35:57 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-16166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15931072#comment-15931072
 ]


Misha Dmitriev commented on HIVE-16166:
---------------------------------------

[~spena] thank you very much for the logs. I found that the failures occur in 
my code with the stack trace like this:

{code}
java.lang.UnsupportedOperationException
        at java.util.AbstractList.set(AbstractList.java:132)
        at java.util.AbstractList$ListItr.set(AbstractList.java:426)
        at 
org.apache.hadoop.hive.common.StringInternUtils.internStringsInList(StringInternUtils.java:112)
        at 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory.getStandardStructObjectInspector(ObjectInspectorFactory.java:320)
        at 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory.getStandardStructObjectInspector(ObjectInspectorFactory.java:312)
...
{code}

The piece of StringUtils.java where this is thrown looks like this:

{code}
ListIterator<String> it = list.listIterator();
while (it.hasNext()) {
  it.set(it.next().intern());
}
{code}

This is the standard, official way which one can use to replace elements in any 
List implemented in JDK, e.g. ArrayList or LinkedList. For both of them, 
listIterator() returns an Iterator that correctly implements the set() 
operation. So if the code throws an exception, my guess is that it received 
some List (probably not from JDK) that doesn't provide the proper Iterator 
implementation.

Now, I think there are two alternatives for dealing with this problem.  The 
first is to try to find the problematic List implementation (if it's in Hive) 
and fix it. This is complicated, given that the stack trace doesn't show the 
problematic List subclass upfront, and I for some reason cannot reproduce this 
problem locally. But in any case, even if this is fixed, it doesn't guarantee 
that in the future somebody will not write another incomplete List 
implementation that will cause this problem again. So probably a better 
solution is to just catch the UnsupportedOperationException in my code and 
return as if nothing happened. After all, string interning is a performance 
optimization, it doesn't affect the application semantics, so if it doesn't 
always work as expected, it's not a serious problem. What do you think?

> HS2 may still waste up to 15% of memory on duplicate strings
> ------------------------------------------------------------
>
>                 Key: HIVE-16166
>                 URL: https://issues.apache.org/jira/browse/HIVE-16166
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Misha Dmitriev
>            Assignee: Misha Dmitriev
>         Attachments: ch_2_excerpt.txt, HIVE-16166.01.patch
>
>
> A heap dump obtained from one of our users shows that 15% of memory is wasted 
> on duplicate strings, despite the recent optimizations that I made. The 
> problematic strings just come from different sources this time. See the 
> excerpt from the jxray (www.jxray.com) analysis attached.
> Adding String.intern() calls in the appropriate places reduces the overhead 
> of duplicate strings with this workload to ~6%. The remaining duplicates come 
> mostly from JDK internal and MapReduce data structures, and thus are more 
> difficult to fix.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-16166) HS2 may still waste up to 15% of memory on duplicate strings

Reply via email to