[ 
https://issues.apache.org/jira/browse/PARQUET-1261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16425919#comment-16425919
 ] 

ASF GitHub Bot commented on PARQUET-1261:
-----------------------------------------

robert3005 commented on issue #92: PARQUET-1261 - Remove string interning
URL: https://github.com/apache/parquet-format/pull/92#issuecomment-378686932
 
 
   I've dug a bit more into jvm source code and it's slightly more 
complicated/not exactly as Scott is saying. String#intern does indeed end up on 
the StringTable in the jvm and there's no distinction between interened strings 
and what compiler/jvm interns. The problem though is that handling of that 
space is gc specific. The article that Scott links is totally accurate for 
default jvm settings and the links I posted were for CMS garbage collector. 
From my reading of the code it looks like interning is really only an issue 
under CMS (since it's very reluctant to retrieve space from it) while 
ParallelGC and G1 will consider it every time it does gc. Additionally 
interning or not you can get benefit of it by using `UseStringDeduplication` 
under G1 (default from java 9 onwards).
   
   I am doing some benchmarking but it seems that switching from string 
interning has potential to those using CMS gc and shouldn't make significant 
difference on newer jvm. Will update the pr once I am done benchmarking 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Parquet-format interns strings when reading filemetadata
> --------------------------------------------------------
>
>                 Key: PARQUET-1261
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1261
>             Project: Parquet
>          Issue Type: Bug
>    Affects Versions: 1.9.0
>            Reporter: Robert Kruszewski
>            Assignee: Robert Kruszewski
>            Priority: Major
>
> Parquet-format when deserializing metadata will intern strings. References I 
> could find suggested that it had been done to reduce memory pressure early 
> on. Java (and jvm in particular) went a long way since then and interning is 
> generally discouraged, see 
> [https://shipilev.net/jvm-anatomy-park/10-string-intern/] for a good 
> explanation. What is more since java 8 there's string deduplication 
> implemented at GC level per [http://openjdk.java.net/jeps/192.] During our 
> usage and testing we found the interning to cause significant gc pressure for 
> long running applications due to bigger GC root set.
> This issue proposes removing interning given it's questionable whether it 
> should be used in modern jvms.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to