zenoyang opened a new issue #8315: URL: https://github.com/apache/incubator-doris/issues/8315
### Search before asking - [X] I had searched in the [issues](https://github.com/apache/incubator-doris/issues?q=is%3Aissue) and found no similar issues. ### Description For the current low-cardinality string column, when loading data in the storage layer, it is necessary to traverse all the codes in the data_page, convert it into the original string value through the dictionary in the dict_page, and then insert it into the ColumnString column. Subsequent predicate calculation, filtering, etc. are based on ColumnString for processing. If you continue to use encoding during predicate calculation, it is equivalent to converting string comparisons to int type comparisons, resulting in greater performance and less memory consumption. ### Solution Storage layer read call: `SegmentIterator::next_batch-> ... -> PageDecoder::next_batch`, the default batch reads 4096 rows, PageDecoder reads part of data_page, data_page may be bitshuffle encoding or plain encoding. After enabling low cardinality optimization, optimize for string columns with predicate calculations. The resulting column is of type ColumnDictionary and is used to store encodings and dictionaries. If the current batch of data_pages are all bitshuffle encoded, that is, directly insert the encoded value and dictionary into the result column, in the process of PageDecode next_batch, the encoding to String overhead (there is the overhead of constructing a dictionary) is saved. After the column data is loaded in the Segment layer, the predicate calculation is performed based on the encoded value and the dictionary, that is, the predicate constant is converted into an int value through the dictionary, and the int value is compared with the int in the encoded column, which is equivalent to comparing the original String. Converted to int comparison, there will be a significant speedup effect. After the predicate is calculated, the encoding column and dictionary are converted into Colum nString columns to end the segment layer reading. If there is plain encoded data_page in the current batch, the dictionary cannot be used and is converted to a PredicateColumn column for processing. ### Are you willing to submit PR? - [X] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
