Hi Dev

I am currenlty working on min max optimization whereIn for string/varhcar
data types column we will decide internally whether to write min max or not.

*Background*
Currently we are storing min max for all the columns. Currently we are
storing page min max, blocklet min max in filefooter and all the blocklet
metadata entries in the shard. Consider the case where each column data
size is more than 10000 characters. In this case if we write min max then
min max will be written 3 times for each column and it will lead to
increase in store size which will impact the query performance.

*Design proposal*
1. We will introduce a configurable system level property for max
characters *"carbon.string.allowed.character.count".* If the data crosses
this limit then min max will not be stored for that column.
2. If a page does not contain min max for a column, then blocklet min max
will also not contain the entry for min max of that column.
3. Thrift file will be modified to introduce a option Boolean flag which
will used in query to identify whether min max is stored for the filter
column or not.
4. As of now it will be supported only for dimensions of string/varchar
type. We can extend it further to support bigDecimal type measures also in
future if required.
5. Block and blocklet dataMap cache will also include storing min max
Boolean flag for dimensions column based on which filter pruning will be
done. If min max is not written for any column then isScanRequired will
return true in driver pruning.
6. In executor again page and blocklet level min max will be checked for
filter column. If min max is not written then complete page data will be
scanned.

*Backward compatibility*
1. For stores prior to 1.5.0 min max flag for all the columns will be set
to true during loading dataMap in query flow.

Please feel free to share your inputs and suggestions.

Regards
Manish Gupta

Reply via email to