I have been pointed to https://github.com/apache/parquet-format/pull/2 which is the orignal pr for parquet-11. Looking at http://hg.openjdk.java.net/jdk10/master/file/be620a591379/src/hotspot/share/gc/cms/concurrentMarkSweepGeneration.cpp#l2563 and http://hg.openjdk.java.net/jdk10/master/file/be620a591379/src/hotspot/share/gc/cms/concurrentMarkSweepGeneration.cpp#l5261 it does look like interned strings are very rarely gc'ed.
On Tue, 3 Apr 2018 at 18:45 Robert Kruszewski <dzob...@gmail.com> wrote: > Hi parquet-dev, > > I wanted to start a discussion around the existence of string interning in > the thrift protocol in parquet-format. I posted some links > https://issues.apache.org/jira/browse/PARQUET-1261 and while I haven't > done perf benchmarking I have previously seen interened strings to cause GC > overhead limit exceeded exceptions. Only reference I could find why this > has been added is reference to > https://issues.apache.org/jira/browse/PARQUET-11 which unfortunately > leads to deleted repo. Wonder if anyone remembers the exact details? > > If we deem string deduplication there to be necessary we should > investigate implementing simple cache instead. I'd hope we can simply get > rid of interning without much harm but would love to hear others opinions. > > Robert >