[ https://issues.apache.org/jira/browse/PARQUET-1261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16425274#comment-16425274 ]
ASF GitHub Bot commented on PARQUET-1261: ----------------------------------------- scottcarey commented on issue #92: PARQUET-1261 - Remove string interning URL: https://github.com/apache/parquet-format/pull/92#issuecomment-378547346 A way to dedupe strings that does not use String.intern() is to use a weak reference set. Basically, use a WeakHashMap, potentially wrapped via Collections.newSetFromMap and Collections.synchronizedSet. If you need concurrency without synchronization, you can't easily use a ConcurrentHashMap, though you can write a class that extends WeakReference for keys and overrides the equals and hashCode method... this is slow because you have to wrap every access in a new WeakReference (which WeakHashMap avoids) If you have Guava on the classpath: https://google.github.io/guava/releases/19.0/api/docs/com/google/common/collect/Interners.html Is thread-safe and based on a concurrent map. I would avoid having Guava on the classpath here because of version conflicts with user code, though perhaps shading a copy of the classes you need is fine. Sadly, you can't create something like this on your own with just Guava's MapMaker because they use a package protected method to allow using key equivalence instead of reference equality on weak keys. The simplest thing is using the JDK's WeakHashMap, and dealing with its thread safety issues. Now for my last point. The original claim that OOMs were caused by String.intern() is bogus if it is JRE 7+. Read this: http://java-performance.info/string-intern-in-java-6-7-8/ Strings that are no longer referenced are GC'd, and the claim that it is not done 'frequently' is false. The code referenced on the mailing list was to when the JVM's class unloading code removes interned strings that were referenced statically by the class and has _nothing_ to do with user-mode calls to `String.intern()`. If OOM conditions are happening, it is not caused by calls to intern(). And switching to a WeakHashMap will only increase the heap required to manage it. Increasing the StringTable size might help somewhat on the performance side (the benchmark at shiplev.net clearly shows how the performance tanks when the distinct string count exceeds the table side). So in summary: 1. The original concern seems incorrect, unless it is running on JRE 6 and has too-small of a perm gen configured. Otherwise, this should not lead to OOM since the Strings are strongly referenced anyway. However, String.intern can lead to some performance issues with under-sized string tables. 2. Doing the interning by hand with Guava's Interner or a WeakHashMap might be a win performance wise, but will likely use a bit more heap. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Parquet-format interns strings when reading filemetadata > -------------------------------------------------------- > > Key: PARQUET-1261 > URL: https://issues.apache.org/jira/browse/PARQUET-1261 > Project: Parquet > Issue Type: Bug > Affects Versions: 1.9.0 > Reporter: Robert Kruszewski > Assignee: Robert Kruszewski > Priority: Major > > Parquet-format when deserializing metadata will intern strings. References I > could find suggested that it had been done to reduce memory pressure early > on. Java (and jvm in particular) went a long way since then and interning is > generally discouraged, see > [https://shipilev.net/jvm-anatomy-park/10-string-intern/] for a good > explanation. What is more since java 8 there's string deduplication > implemented at GC level per [http://openjdk.java.net/jeps/192.] During our > usage and testing we found the interning to cause significant gc pressure for > long running applications due to bigger GC root set. > This issue proposes removing interning given it's questionable whether it > should be used in modern jvms. -- This message was sent by Atlassian JIRA (v7.6.3#76005)