[I] Reduce Heap Usage of OnHeapStringDictionary [pinot]

via GitHub Thu, 30 Nov 2023 16:33:39 -0800


vvivekiyer opened a new issue, #12078:
URL: https://github.com/apache/pinot/issues/12078

With our OnHeapStringDictionary, we've observed that if a column has a lot
of duplicates, it can waste a lot of our heap usage.

Below is JXray analysis of the heapdump for one usecase in Linkedin where
the `OnHeapStringDictionary` uses about 13GB of heap

![image](https://github.com/apache/pinot/assets/21298365/e428e06b-64bf-4d02-ad32-19545ec7b323)

String Interning described in https://www.baeldung.com/string/intern solves
this problem. However, there could be certain high-cardinality columns (even
with enough duplicates) where interning can be counter productive. So we can
solve this with a fixed size interner as described in the following article
https://dzone.com/articles/duplicate-strings-how-to-get-rid-of-them-and-save.

I attempted to PoC this change on one of our usecases and observed that the
we saw huge savings in heap usage. Below is the heapdump analysis with my PoC
change. Note that I used a size of 32M for the fixed size interner.

![image](https://github.com/apache/pinot/assets/21298365/b740620f-6104-4808-90e4-02514d1e60bc)

I'm planning to expose a new tableIndexConfig called
`onHeapDictionaryConfig` that will allow us to enable interning and control the
size of the Fixed Size interner.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Reduce Heap Usage of OnHeapStringDictionary [pinot]

Reply via email to