gortiz commented on issue #12078: URL: https://github.com/apache/pinot/issues/12078#issuecomment-1884447441
BTW, this issue is focused on the memory impact of the dictionary. But there is another theoretical improvement here. The solution proposed in #[12223](https://github.com/apache/pinot/pull/12223) has the side effect that two equal string literals that belong to the same column in different segments will _probably_ be resolved to the same Java String object. When working with ClickBench, I've seen that we waste a lot of time evaluating equals between actually equal (but not same) String objects when these Strings are used as aggregation keys. With this PR it is possible to find that these two equal String values that were read from different segments are actually the same String Java object, which means that the equals may be evaluated in constant time instead of linear (comparing all bytes). We should verify the impact in reality of this theoretical reasoning, but in case it actually shows an increase in performance, we could apply the same technique in the brokers when data is being read (interning strings sent by different servers). Although, as said in my previous message, I think the largest improvement would be to use a Str class that actually doesn't allocate in heap if it is not needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
