[ https://issues.apache.org/jira/browse/HIVE-16166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15931072#comment-15931072 ]
Misha Dmitriev commented on HIVE-16166: --------------------------------------- [~spena] thank you very much for the logs. I found that the failures occur in my code with the stack trace like this: {code} java.lang.UnsupportedOperationException at java.util.AbstractList.set(AbstractList.java:132) at java.util.AbstractList$ListItr.set(AbstractList.java:426) at org.apache.hadoop.hive.common.StringInternUtils.internStringsInList(StringInternUtils.java:112) at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory.getStandardStructObjectInspector(ObjectInspectorFactory.java:320) at org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory.getStandardStructObjectInspector(ObjectInspectorFactory.java:312) ... {code} The piece of StringUtils.java where this is thrown looks like this: {code} ListIterator<String> it = list.listIterator(); while (it.hasNext()) { it.set(it.next().intern()); } {code} This is the standard, official way which one can use to replace elements in any List implemented in JDK, e.g. ArrayList or LinkedList. For both of them, listIterator() returns an Iterator that correctly implements the set() operation. So if the code throws an exception, my guess is that it received some List (probably not from JDK) that doesn't provide the proper Iterator implementation. Now, I think there are two alternatives for dealing with this problem. The first is to try to find the problematic List implementation (if it's in Hive) and fix it. This is complicated, given that the stack trace doesn't show the problematic List subclass upfront, and I for some reason cannot reproduce this problem locally. But in any case, even if this is fixed, it doesn't guarantee that in the future somebody will not write another incomplete List implementation that will cause this problem again. So probably a better solution is to just catch the UnsupportedOperationException in my code and return as if nothing happened. After all, string interning is a performance optimization, it doesn't affect the application semantics, so if it doesn't always work as expected, it's not a serious problem. What do you think? > HS2 may still waste up to 15% of memory on duplicate strings > ------------------------------------------------------------ > > Key: HIVE-16166 > URL: https://issues.apache.org/jira/browse/HIVE-16166 > Project: Hive > Issue Type: Improvement > Reporter: Misha Dmitriev > Assignee: Misha Dmitriev > Attachments: ch_2_excerpt.txt, HIVE-16166.01.patch > > > A heap dump obtained from one of our users shows that 15% of memory is wasted > on duplicate strings, despite the recent optimizations that I made. The > problematic strings just come from different sources this time. See the > excerpt from the jxray (www.jxray.com) analysis attached. > Adding String.intern() calls in the appropriate places reduces the overhead > of duplicate strings with this workload to ~6%. The remaining duplicates come > mostly from JDK internal and MapReduce data structures, and thus are more > difficult to fix. -- This message was sent by Atlassian JIRA (v6.3.15#6346)