Hi All, In order to make no-dictionary columns as default we should improve the storage and performance for these columns. I have sent another mail to discuss the improvement points. Please comment on it.
Regards, Ravindra On 1 March 2017 at 10:12, Ravindra Pesala <ravi.pes...@gmail.com> wrote: > Hi Likun, > > It would be same case if we use all non dictionary columns by default, it > will increase the store size and decrease the performance so it is also > does not encourage more users if performance is poor. > > If we need to make no-dictionary columns as default then we should first > focus on reducing the store size and improve the filter queries on > non-dictionary columns.Even memory usage is higher while querying the > non-dictionary columns. > > Regards, > Ravindra. > > On 1 March 2017 at 06:00, Jacky Li <jacky.li...@qq.com> wrote: > >> Yes, I agree to your point. The only concern I have is for loading, I >> have seen many users accidentally put high cardinality column into >> dictionary column then the loading failed because out of memory or loading >> very slow. I guess they just do not know to use DICTIONARY_EXCLUDE for >> these columns, or they do not have a easy way to identify the high card >> columns. I feel preventing such misusage is important in order to encourage >> more users to use carbondata. >> >> Any suggestion on solving this issue? >> >> >> Regards, >> Likun >> >> >> > 在 2017年2月28日,下午10:20,Ravindra Pesala <ravi.pes...@gmail.com> 写道: >> > >> > Hi Likun, >> > >> > You mentioned that if user does not specify dictionary columns then by >> > default those are chosen as no dictionary columns. >> > But we have many disadvantages as I mentioned in above mail if you keep >> no >> > dictionary as default. We have initially introduced no dictionary >> columns >> > to handle high cardinality dimensions, but now making every thing as no >> > dictionary columns by default looses our unique feature compare to >> parquet. >> > Dictionary columns are introduced not only for aggregation queries, it >> is >> > for better compression and better filter queries as well. With out >> > dictionary store size will be increased a lot. >> > >> > Regards, >> > Ravindra. >> > >> > On 28 February 2017 at 18:05, Liang Chen <chenliang6...@gmail.com> >> wrote: >> > >> >> Hi >> >> >> >> A couple of questions: >> >> >> >> 1) For SORT_KEY option: only build "MDK index, inverted index, minmax >> >> index" for these columns which be specified into the option(SORT_KEY) >> ? >> >> >> >> 2) If users don't specify TABLE_DICTIONARY, then all columns don't >> make >> >> dictionary encoding, and all shuffle operations are based on fact >> value, is >> >> my understanding right ? >> >> ------------------------------------------------------------ >> >> ------------------------------------------- >> >> If this option is not specified by user, means all columns encoding >> without >> >> global dictionary support. Normal shuffle on decoded value will be >> applied >> >> when doing group by operation. >> >> >> >> 3) After introducing the two options "SORT_KEY and TABLE_DICTIONARY", >> >> supposed if "C2" be specified into SORT_KEY, but not be specified into >> >> TABLE_DICTIONARY, then system how to handle this case ? >> >> ------------------------------------------------------------ >> >> ----------------------------------------------- >> >> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and >> encoded as >> >> Inverted Index and with Minmax Index >> >> >> >> Regards >> >> Liang >> >> >> >> 2017-02-28 19:35 GMT+08:00 Jacky Li <jacky.li...@qq.com>: >> >> >> >>> Yes, first we should simplify the DDL options. I propose following >> >> options, >> >>> please check weather it miss some scenario. >> >>> >> >>> 1. SORT_COLUMNS, or SORT_KEY >> >>> This indicates three things: >> >>> 1) All columns specified in options will be used to construct >> >>> Multi-Dimensional Key, which will be sorted along this key >> >>> 2) They will be encoded as Inverted Index and thus again sorted within >> >>> column chunk in one blocklet >> >>> 3) Minmax index will also be created for these columns >> >>> >> >>> When to use: This option is designed for accelerating filter query, so >> >> put >> >>> all filter columns into this option. The order of it can be: >> >>> 1) From low cardinality to high cardinality, this will make most >> >>> compression >> >>> and fit for scenario that does not have frequent filter on high card >> >> column >> >>> 2) Put high cardinality column first, then put others. This fits for >> >>> frequent filter on high card column >> >>> >> >>> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and >> encoded >> >> as >> >>> Inverted Index and with Minmax Index >> >>> Note that while C1,C2,C3 can be dimension but they also can be >> measure. >> >> So >> >>> if user need to filter on measure column, it can be put in >> SORT_COLUMNS >> >>> option. >> >>> >> >>> If this option is not specified by user, carbon will pick MDK as it is >> >> now. >> >>> >> >>> 2. TABLE_DICTIONARY >> >>> This is to specify the table level dictionary columns. Will create >> global >> >>> dictionary for all columns in this option for every data load. >> >>> >> >>> When to use: The option is designed for accelerating aggregate query, >> so >> >>> put >> >>> group by columns into this option >> >>> >> >>> For example. TABLE_DICTIONARY=“C2,C3,C5” >> >>> >> >>> If this option is not specified by user, means all columns encoding >> >> without >> >>> global dictionary support. Normal shuffle on decoded value will be >> >> applied >> >>> when doing group by operation. >> >>> >> >>> I think these two options should be the basic option for normal user, >> the >> >>> goal of them is to satisfy the most scenario without deep tuning of >> the >> >>> table >> >>> For advanced user who want to do deep tuning, we can debate to add >> more >> >>> options. But we need to identify what scenario is not satisfied by >> using >> >>> these two options first. >> >>> >> >>> Regards, >> >>> Jacky >> >>> >> >>> >> >>> >> >>> -- >> >>> View this message in context: http://apache-carbondata- >> >>> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the- >> >>> dimension-default-should-be-no-dictionary-tp8010p8081.html >> >>> Sent from the Apache CarbonData Mailing List archive mailing list >> archive >> >>> at Nabble.com. >> >>> >> >> >> >> >> >> >> >> -- >> >> Regards >> >> Liang >> >> >> > >> > >> > -- >> > Thanks & Regards, >> > Ravi >> >> >> >> > > > -- > Thanks & Regards, > Ravi > -- Thanks & Regards, Ravi