vvivekiyer opened a new issue, #12078:
URL: https://github.com/apache/pinot/issues/12078

   With our OnHeapStringDictionary, we've observed that if a column has a lot 
of duplicates, it can waste a lot of our heap usage.
   
   Below is JXray analysis of the heapdump for one usecase in Linkedin where 
the `OnHeapStringDictionary` uses about 13GB of heap 
   
![image](https://github.com/apache/pinot/assets/21298365/e428e06b-64bf-4d02-ad32-19545ec7b323)
   
   String Interning described in https://www.baeldung.com/string/intern solves 
this problem. However, there could be certain high-cardinality columns (even 
with enough duplicates) where interning can be counter productive. So we can 
solve this with a fixed size interner as described in the following article  
https://dzone.com/articles/duplicate-strings-how-to-get-rid-of-them-and-save. 
   
   
   I attempted to PoC this change on one of our usecases and observed that the 
we saw huge savings in heap usage. Below is the heapdump analysis with my PoC 
change. Note that I used a size of 32M for the fixed size interner. 
   
![image](https://github.com/apache/pinot/assets/21298365/b740620f-6104-4808-90e4-02514d1e60bc)
   
   
   I'm planning to expose a new tableIndexConfig called 
`onHeapDictionaryConfig` that will allow us to enable interning and control the 
size of the Fixed Size interner. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to