Hi parquet-dev, I wanted to start a discussion around the existence of string interning in the thrift protocol in parquet-format. I posted some links https://issues.apache.org/jira/browse/PARQUET-1261 and while I haven't done perf benchmarking I have previously seen interened strings to cause GC overhead limit exceeded exceptions. Only reference I could find why this has been added is reference to https://issues.apache.org/jira/browse/PARQUET-11 which unfortunately leads to deleted repo. Wonder if anyone remembers the exact details?
If we deem string deduplication there to be necessary we should investigate implementing simple cache instead. I'd hope we can simply get rid of interning without much harm but would love to hear others opinions. Robert