I have been pointed to https://github.com/apache/parquet-format/pull/2
which is the orignal pr for parquet-11. Looking at
http://hg.openjdk.java.net/jdk10/master/file/be620a591379/src/hotspot/share/gc/cms/concurrentMarkSweepGeneration.cpp#l2563
and
http://hg.openjdk.java.net/jdk10/master/file/be620a591379/src/hotspot/share/gc/cms/concurrentMarkSweepGeneration.cpp#l5261
it does look like interned strings are very rarely gc'ed.

On Tue, 3 Apr 2018 at 18:45 Robert Kruszewski <dzob...@gmail.com> wrote:

> Hi parquet-dev,
>
> I wanted to start a discussion around the existence of string interning in
> the thrift protocol in parquet-format. I posted some links
> https://issues.apache.org/jira/browse/PARQUET-1261 and while I haven't
> done perf benchmarking I have previously seen interened strings to cause GC
> overhead limit exceeded exceptions. Only reference I could find why this
> has been added is reference to
> https://issues.apache.org/jira/browse/PARQUET-11 which unfortunately
> leads to deleted repo. Wonder if anyone remembers the exact details?
>
> If we deem string deduplication there to be necessary we should
> investigate implementing simple cache instead. I'd hope we can simply get
> rid of interning without much harm but would love to hear others opinions.
>
> Robert
>

Reply via email to