vvivekiyer opened a new issue, #12078: URL: https://github.com/apache/pinot/issues/12078
With our OnHeapStringDictionary, we've observed that if a column has a lot of duplicates, it can waste a lot of our heap usage. Below is JXray analysis of the heapdump for one usecase in Linkedin where the `OnHeapStringDictionary` uses about 13GB of heap  String Interning described in https://www.baeldung.com/string/intern solves this problem. However, there could be certain high-cardinality columns (even with enough duplicates) where interning can be counter productive. So we can solve this with a fixed size interner as described in the following article https://dzone.com/articles/duplicate-strings-how-to-get-rid-of-them-and-save. I attempted to PoC this change on one of our usecases and observed that the we saw huge savings in heap usage. Below is the heapdump analysis with my PoC change. Note that I used a size of 32M for the fixed size interner.  I'm planning to expose a new tableIndexConfig called `onHeapDictionaryConfig` that will allow us to enable interning and control the size of the Fixed Size interner. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
