[ https://issues.apache.org/jira/browse/PARQUET-1261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16425919#comment-16425919 ]
ASF GitHub Bot commented on PARQUET-1261: ----------------------------------------- robert3005 commented on issue #92: PARQUET-1261 - Remove string interning URL: https://github.com/apache/parquet-format/pull/92#issuecomment-378686932 I've dug a bit more into jvm source code and it's slightly more complicated/not exactly as Scott is saying. String#intern does indeed end up on the StringTable in the jvm and there's no distinction between interened strings and what compiler/jvm interns. The problem though is that handling of that space is gc specific. The article that Scott links is totally accurate for default jvm settings and the links I posted were for CMS garbage collector. From my reading of the code it looks like interning is really only an issue under CMS (since it's very reluctant to retrieve space from it) while ParallelGC and G1 will consider it every time it does gc. Additionally interning or not you can get benefit of it by using `UseStringDeduplication` under G1 (default from java 9 onwards). I am doing some benchmarking but it seems that switching from string interning has potential to those using CMS gc and shouldn't make significant difference on newer jvm. Will update the pr once I am done benchmarking ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Parquet-format interns strings when reading filemetadata > -------------------------------------------------------- > > Key: PARQUET-1261 > URL: https://issues.apache.org/jira/browse/PARQUET-1261 > Project: Parquet > Issue Type: Bug > Affects Versions: 1.9.0 > Reporter: Robert Kruszewski > Assignee: Robert Kruszewski > Priority: Major > > Parquet-format when deserializing metadata will intern strings. References I > could find suggested that it had been done to reduce memory pressure early > on. Java (and jvm in particular) went a long way since then and interning is > generally discouraged, see > [https://shipilev.net/jvm-anatomy-park/10-string-intern/] for a good > explanation. What is more since java 8 there's string deduplication > implemented at GC level per [http://openjdk.java.net/jeps/192.] During our > usage and testing we found the interning to cause significant gc pressure for > long running applications due to bigger GC root set. > This issue proposes removing interning given it's questionable whether it > should be used in modern jvms. -- This message was sent by Atlassian JIRA (v7.6.3#76005)