+1 It is essential feature in case of big strings . We should not store min/max for large text columns as it increases storage.
Regards, Ravindra On Sat, 15 Sep 2018 at 12:14 PM, manish gupta <tomanishgupt...@gmail.com> wrote: > Hi Dev > > I am currenlty working on min max optimization whereIn for string/varhcar > data types column we will decide internally whether to write min max or > not. > > *Background* > Currently we are storing min max for all the columns. Currently we are > storing page min max, blocklet min max in filefooter and all the blocklet > metadata entries in the shard. Consider the case where each column data > size is more than 10000 characters. In this case if we write min max then > min max will be written 3 times for each column and it will lead to > increase in store size which will impact the query performance. > > *Design proposal* > 1. We will introduce a configurable system level property for max > characters *"carbon.string.allowed.character.count".* If the data crosses > this limit then min max will not be stored for that column. > 2. If a page does not contain min max for a column, then blocklet min max > will also not contain the entry for min max of that column. > 3. Thrift file will be modified to introduce a option Boolean flag which > will used in query to identify whether min max is stored for the filter > column or not. > 4. As of now it will be supported only for dimensions of string/varchar > type. We can extend it further to support bigDecimal type measures also in > future if required. > 5. Block and blocklet dataMap cache will also include storing min max > Boolean flag for dimensions column based on which filter pruning will be > done. If min max is not written for any column then isScanRequired will > return true in driver pruning. > 6. In executor again page and blocklet level min max will be checked for > filter column. If min max is not written then complete page data will be > scanned. > > *Backward compatibility* > 1. For stores prior to 1.5.0 min max flag for all the columns will be set > to true during loading dataMap in query flow. > > Please feel free to share your inputs and suggestions. > > Regards > Manish Gupta >