[ 
https://issues.apache.org/jira/browse/PARQUET-1261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16425274#comment-16425274
 ] 

ASF GitHub Bot commented on PARQUET-1261:
-----------------------------------------

scottcarey commented on issue #92: PARQUET-1261 - Remove string interning
URL: https://github.com/apache/parquet-format/pull/92#issuecomment-378547346
 
 
   A way to dedupe strings that does not use String.intern() is to use a weak 
reference set.
   
   Basically, use a WeakHashMap, potentially wrapped via 
Collections.newSetFromMap and Collections.synchronizedSet.   If you need 
concurrency without synchronization, you can't easily use a ConcurrentHashMap, 
though you can write a class that extends WeakReference for keys and overrides 
the equals and hashCode method... this is slow because you have to wrap every 
access in a new WeakReference (which WeakHashMap avoids)
   
   If you have Guava on the classpath:  
https://google.github.io/guava/releases/19.0/api/docs/com/google/common/collect/Interners.html
  Is thread-safe and based on a concurrent map.  I would avoid having Guava on 
the classpath here because of version conflicts with user code, though perhaps 
shading a copy of the classes you need is fine.
   
   Sadly, you can't create something like this on your own with just Guava's 
MapMaker because they use a package protected method to allow using key 
equivalence instead of reference equality on weak keys.
   
   The simplest thing is using the JDK's WeakHashMap, and dealing with its 
thread safety issues.
   
   
   Now for my last point.  The original claim that OOMs were caused by 
String.intern() is bogus if it is JRE 7+.
   
   Read this: http://java-performance.info/string-intern-in-java-6-7-8/
   
   Strings that are no longer referenced are GC'd, and the claim that it is not 
done 'frequently' is false.  The code referenced on the mailing list was to 
when the JVM's class unloading code removes interned strings that were 
referenced statically by the class and has _nothing_ to do with user-mode calls 
to `String.intern()`.
   
   If OOM conditions are happening, it is not caused by calls to intern().   
And switching to a WeakHashMap will only increase the heap required to manage 
it.   Increasing the StringTable size might help somewhat on the performance 
side (the benchmark at shiplev.net clearly shows how the performance tanks when 
the distinct string count exceeds the table side).
   
   So in summary:
   
   1.  The original concern seems incorrect, unless it is running on JRE 6 and 
has too-small of a perm gen configured.  Otherwise, this should not lead to OOM 
since the Strings are strongly referenced anyway.  However, String.intern can 
lead to some performance issues with under-sized string tables.
   2.  Doing the interning by hand with Guava's Interner or a WeakHashMap might 
be a win performance wise, but will likely use a bit more heap.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Parquet-format interns strings when reading filemetadata
> --------------------------------------------------------
>
>                 Key: PARQUET-1261
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1261
>             Project: Parquet
>          Issue Type: Bug
>    Affects Versions: 1.9.0
>            Reporter: Robert Kruszewski
>            Assignee: Robert Kruszewski
>            Priority: Major
>
> Parquet-format when deserializing metadata will intern strings. References I 
> could find suggested that it had been done to reduce memory pressure early 
> on. Java (and jvm in particular) went a long way since then and interning is 
> generally discouraged, see 
> [https://shipilev.net/jvm-anatomy-park/10-string-intern/] for a good 
> explanation. What is more since java 8 there's string deduplication 
> implemented at GC level per [http://openjdk.java.net/jeps/192.] During our 
> usage and testing we found the interning to cause significant gc pressure for 
> long running applications due to bigger GC root set.
> This issue proposes removing interning given it's questionable whether it 
> should be used in modern jvms.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to