Re: carbon data performance doubts

Jacky Li Fri, 21 Jul 2017 09:31:53 -0700

Hi Swapnil,

Dictionary is beneficial for aggregation query (carbon will leverage late 
decode optimization in sql optimizer), so you can use it for columns on which 
you frequently do group by. While it can improve query performance, but it also 
requires more memory and CPU while loading. Normally, you should consider to 
use dictionary only on low cardinality columns.

In current apache master branch (and all history release before 1.2), carbon 
data’s default encoding strategy favor query performance over loading 
performance. By default,  all string data type by default is encoded as 
dictionary. But it creates some problems sometimes, for example, if there are 
high cardinality column in the table, loading may fail due to not enough memory 
in JVM. To avoid this, we have added DICTIONARY_EXCLUDE option so that user can 
disable this default behavior manually. So, DICTIONARY_EXCLUDE property is 
designed for String column only.

And, if you have low cardinality integer column ( like some ID field), you can 
choose to encode it as dictionary by specifying DICTIONARY_INCLUDE, so group by 
on this integer column will be faster.

All these are current behavior, and there was discussion to change this 
behavior and give more control to the user, in the coming release (1.2)
The new proposed target behavior will be:
1. There will be a default encoding strategy for each data type. If user does 
not specify any encoding related property in CREATE TABLE, carbon will use the 
default encoding strategy for each column.
2. And there will be a ENCODING property through which user can override the 
system default strategy. For example, user can create table by:

CREATE TABLE t1 (city_name STRING, city_id INT, population INT, area DOUBLE)
TBLPROPERTIES (‘ENCODING’ = ‘city_name: dictionary, city_id: {dictionary, RLE}, 
population: delta’)

This SQL means city_name is encoded using dictionary, city_id is encoded using 
dictionary then apply RLE encoding (for numeric value), population is encoded 
using delta encoding, and area is encoded using system default encoding for 
double data type.

This change is still going on (CARBONDATA-1014, 
https://issues.apache.org/jira/browse/CARBONDATA-1014 
<https://issues.apache.org/jira/browse/CARBONDATA-1014>), on 
apache/encoding_override branch. Once it is done and stable it will be merged 
into master. 

Please advise if you have any suggestions.

Regards,
Jacky

> 在 2017年7月21日，上午12:12，Swapnil Shinde <swapnilushi...@gmail.com> 写道：
> 
> Ok. Just curious - Any reason not to support numeric columns with
> dictionary_exclude? Wouldn't it be useful for numeric unique column which
> should be dimension but avoid creating dictionary  (as it may not be
> beneficial).
> 
> Thanks
> Swapnil
> 
> 
> On Thu, Jul 20, 2017 at 4:20 AM, manishgupta88 <tomanishgupt...@gmail.com>
> wrote:
> 
>> No Dictionary_Exclude is supported only for String data type columns.
>> 
>> Regards
>> Manish Gupta
>> 
>> 
>> 
>> --
>> View this message in context: http://apache-carbondata-dev-
>> mailing-list-archive.1130556.n5.nabble.com/carbon-data-performance-doubts-
>> tp18438p18559.html
>> Sent from the Apache CarbonData Dev Mailing List archive mailing list
>> archive at Nabble.com.
>>

Re: carbon data performance doubts

Reply via email to