Hi parquet-dev,

I wanted to start a discussion around the existence of string interning in
the thrift protocol in parquet-format. I posted some links
https://issues.apache.org/jira/browse/PARQUET-1261 and while I haven't done
perf benchmarking I have previously seen interened strings to cause GC
overhead limit exceeded exceptions. Only reference I could find why this
has been added is reference to
https://issues.apache.org/jira/browse/PARQUET-11 which unfortunately leads
to deleted repo. Wonder if anyone remembers the exact details?

If we deem string deduplication there to be necessary we should investigate
implementing simple cache instead. I'd hope we can simply get rid of
interning without much harm but would love to hear others opinions.

Robert

Reply via email to