gortiz commented on issue #12078: URL: https://github.com/apache/pinot/issues/12078#issuecomment-1882702995
> Why do we need to store strings? We should probably use byte array right and avoid creating string in the first place? I think that is something we need to explore in the longer term. We would reduce the GC usage by a lot if we do that. > Just to add some context, the reason why this was added in the first place was the fact that for certain workloads, byte -> String de-serialization was becoming the bottleneck. Sure, that is something we need to take into account and be careful with the implementation. This is specially problematic when strings are not [normalized](https://en.wikipedia.org/wiki/Unicode_equivalence). What I did in the past was to use a Str class that has two attributes: a ByteBuffer and String. When the Str is build from IO buffers, the bytebuffer is set to the slice and the String is set to null. When a materialization is needed (for example, the io buffer will be released or we need to compare the strings), a `materialize()` method is called. That method initializes the String and after that moment the String is always used. By doing so we can skip the String creation (and therefore heap allocation) in almost all cases where the Str is not used as aggregation key. > We did try the GC optimizations with -XX:+UseStringDeduplication (and others) but noticed elevated CPU usage affecting our query latencies. I may be wrong, but dictionaries are bound to the query lifetime, right? I mean, we create the dictionary when the segment is being queried and do not re-use it in following queries. If that is the case String Deduplication won't be useful at all because it is only used on Strings in the old generation. > I would recommend against using String.intern, see an authoritative source [here](https://shipilev.net/jvm/anatomy-quarks/10-string-intern/), which recommends manual interning over use of String.intern. I'm with Richard here. My experience with String.intern is bad. It is just better to use our own structure to intern Strings. Something as simple as a Guava Cache is usually better than String.intern. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
