Re: [DISCUSS] For the dimension default should be no dictionary

Ravindra Pesala Wed, 01 Mar 2017 04:38:41 -0800

Hi All,

In order to make no-dictionary columns as default we should improve the
storage and performance for these columns. I have sent another mail to
discuss the improvement points. Please comment on it.


Regards,
Ravindra

On 1 March 2017 at 10:12, Ravindra Pesala <ravi.pes...@gmail.com> wrote:

> Hi Likun,
>
> It would be same case if we use all non dictionary columns by default, it
> will increase the store size and decrease the performance so it is also
> does not encourage more users if performance is poor.
>
> If we need to make no-dictionary columns as default then we should first
> focus on reducing the store size and improve the filter queries on
> non-dictionary columns.Even memory usage is higher while querying the
> non-dictionary columns.
>
> Regards,
> Ravindra.
>
> On 1 March 2017 at 06:00, Jacky Li <jacky.li...@qq.com> wrote:
>
>> Yes, I agree to your point. The only concern I have is for loading, I
>> have seen many users accidentally put high cardinality column into
>> dictionary column then the loading failed because out of memory or loading
>> very slow. I guess they just do not know to use DICTIONARY_EXCLUDE for
>> these columns, or they do not have a easy way to identify the high card
>> columns. I feel preventing such misusage is important in order to encourage
>> more users to use carbondata.
>>
>> Any suggestion on solving this issue?
>>
>>
>> Regards,
>> Likun
>>
>>
>> > 在 2017年2月28日，下午10:20，Ravindra Pesala <ravi.pes...@gmail.com> 写道：
>> >
>> > Hi Likun,
>> >
>> > You mentioned that if user does not specify dictionary columns then by
>> > default those are chosen as no dictionary columns.
>> > But we have many disadvantages as I mentioned in above mail if you keep
>> no
>> > dictionary as default. We have initially introduced no dictionary
>> columns
>> > to handle high cardinality dimensions, but now making every thing as no
>> > dictionary columns by default looses our unique feature compare to
>> parquet.
>> > Dictionary columns are introduced not only for aggregation queries, it
>> is
>> > for better compression and better filter queries as well. With out
>> > dictionary store size will be increased a lot.
>> >
>> > Regards,
>> > Ravindra.
>> >
>> > On 28 February 2017 at 18:05, Liang Chen <chenliang6...@gmail.com>
>> wrote:
>> >
>> >> Hi
>> >>
>> >> A couple of questions:
>> >>
>> >> 1) For SORT_KEY option: only build "MDK index, inverted index, minmax
>> >> index" for these columns which be specified into the option(SORT_KEY)
>> ?
>> >>
>> >> 2) If users don't specify TABLE_DICTIONARY,  then all columns don't
>> make
>> >> dictionary encoding, and all shuffle operations are based on fact
>> value, is
>> >> my understanding right ?
>> >> ------------------------------------------------------------
>> >> -------------------------------------------
>> >> If this option is not specified by user, means all columns encoding
>> without
>> >> global dictionary support. Normal shuffle on decoded value will be
>> applied
>> >> when doing group by operation.
>> >>
>> >> 3) After introducing the two options "SORT_KEY and TABLE_DICTIONARY",
>> >> supposed  if "C2" be specified into SORT_KEY, but not be specified into
>> >> TABLE_DICTIONARY, then system how to handle this case ?
>> >> ------------------------------------------------------------
>> >> -----------------------------------------------
>> >> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and
>> encoded as
>> >> Inverted Index and with Minmax Index
>> >>
>> >> Regards
>> >> Liang
>> >>
>> >> 2017-02-28 19:35 GMT+08:00 Jacky Li <jacky.li...@qq.com>:
>> >>
>> >>> Yes, first we should simplify the DDL options. I propose following
>> >> options,
>> >>> please check weather it miss some scenario.
>> >>>
>> >>> 1. SORT_COLUMNS, or SORT_KEY
>> >>> This indicates three things:
>> >>> 1) All columns specified in options will be used to construct
>> >>> Multi-Dimensional Key, which will be sorted along this key
>> >>> 2) They will be encoded as Inverted Index and thus again sorted within
>> >>> column chunk in one blocklet
>> >>> 3) Minmax index will also be created for these columns
>> >>>
>> >>> When to use: This option is designed for accelerating filter query, so
>> >> put
>> >>> all filter columns into this option. The order of it can be:
>> >>> 1) From low cardinality to high cardinality, this will make most
>> >>> compression
>> >>> and fit for scenario that does not have frequent filter on high card
>> >> column
>> >>> 2) Put high cardinality column first, then put others. This fits for
>> >>> frequent filter on high card column
>> >>>
>> >>> For example, SORT_COLUMNS=“C1,C2,C3”, means C1,C2,C3 is MDK and
>> encoded
>> >> as
>> >>> Inverted Index and with Minmax Index
>> >>> Note that while C1,C2,C3 can be dimension but they also can be
>> measure.
>> >> So
>> >>> if user need to filter on measure column, it can be put in
>> SORT_COLUMNS
>> >>> option.
>> >>>
>> >>> If this option is not specified by user, carbon will pick MDK as it is
>> >> now.
>> >>>
>> >>> 2. TABLE_DICTIONARY
>> >>> This is to specify the table level dictionary columns. Will create
>> global
>> >>> dictionary for all columns in this option for every data load.
>> >>>
>> >>> When to use: The option is designed for accelerating aggregate query,
>> so
>> >>> put
>> >>> group by columns into this option
>> >>>
>> >>> For example. TABLE_DICTIONARY=“C2,C3,C5”
>> >>>
>> >>> If this option is not specified by user, means all columns encoding
>> >> without
>> >>> global dictionary support. Normal shuffle on decoded value will be
>> >> applied
>> >>> when doing group by operation.
>> >>>
>> >>> I think these two options should be the basic option for normal user,
>> the
>> >>> goal of them is to satisfy the most scenario without deep tuning of
>> the
>> >>> table
>> >>> For advanced user who want to do deep tuning, we can debate to add
>> more
>> >>> options. But we need to identify what scenario is not satisfied by
>> using
>> >>> these two options first.
>> >>>
>> >>> Regards,
>> >>> Jacky
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> View this message in context: http://apache-carbondata-
>> >>> mailing-list-archive.1130556.n5.nabble.com/DISCUSS-For-the-
>> >>> dimension-default-should-be-no-dictionary-tp8010p8081.html
>> >>> Sent from the Apache CarbonData Mailing List archive mailing list
>> archive
>> >>> at Nabble.com.
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Regards
>> >> Liang
>> >>
>> >
>> >
>> > --
>> > Thanks & Regards,
>> > Ravi
>>
>>
>>
>>
>
>
> --
> Thanks & Regards,
> Ravi
>



-- 
Thanks & Regards,
Ravi

Re: [DISCUSS] For the dimension default should be no dictionary

Reply via email to