zenoyang opened a new issue #8315:
URL: https://github.com/apache/incubator-doris/issues/8315


   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/incubator-doris/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### Description
   
   For the current low-cardinality string column, when loading data in the 
storage layer, it is necessary to traverse all the codes in the data_page, 
convert it into the original string value through the dictionary in the 
dict_page, and then insert it into the ColumnString column. Subsequent 
predicate calculation, filtering, etc. are based on ColumnString for processing.
   
   If you continue to use encoding during predicate calculation, it is 
equivalent to converting string comparisons to int type comparisons, resulting 
in greater performance and less memory consumption.
   
   
   ### Solution
   
   Storage layer read call: `SegmentIterator::next_batch-> ... -> 
PageDecoder::next_batch`, the default batch reads 4096 rows, PageDecoder reads 
part of data_page, data_page may be bitshuffle encoding or plain encoding.
   
   After enabling low cardinality optimization, optimize for string columns 
with predicate calculations. The resulting column is of type ColumnDictionary 
and is used to store encodings and dictionaries. If the current batch of 
data_pages are all bitshuffle encoded, that is, directly insert the encoded 
value and dictionary into the result column, in the process of PageDecode 
next_batch, the encoding to String overhead (there is the overhead of 
constructing a dictionary) is saved. After the column data is loaded in the 
Segment layer, the predicate calculation is performed based on the encoded 
value and the dictionary, that is, the predicate constant is converted into an 
int value through the dictionary, and the int value is compared with the int in 
the encoded column, which is equivalent to comparing the original String. 
Converted to int comparison, there will be a significant speedup effect. After 
the predicate is calculated, the encoding column and dictionary are converted 
into Colum
 nString columns to end the segment layer reading.
   
   If there is plain encoded data_page in the current batch, the dictionary 
cannot be used and is converted to a PredicateColumn column for processing.
   
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to