Hi 

Thanks for you shared these experience.
Can you put these FAQ to CWIKI:
https://cwiki.apache.org/confluence/display/CARBONDATA/CarbonData+Home

Regards
Liang



bill.zhou wrote
> Discussion how to crate the CarbonData table with good performance

> Suggestion to create Carbon table

> Recently we used CarbonData to do the performance in Telecommunication
> filed and summarize some of the Suggestions while creating the CarbonData
> table.
> </br>
> We have tables which range from 10 thousand rows to 10 billion rows and
> have from 100 columns to 300 columns. Following are some of the columns
> used in the table.

> 

> 

> Column name

> 

> Data type

> 

> Cardinality

> 

> Attribution

> 

> 

> 

> msisdn

> 

> String

> 

> 30 million

> 

> dimension

> 

> 

> 

> BEGIN_TIME

> 

> bigint

> 

> 10 thousand

> 

> dimension

> 

> 

> 

> HOST

> 

> String

> 

> 1 million

> 

> dimension

> 

> 

> 

> Dime_1

> 

> String

> 

> 1 thousand

> 

> dimension

> 

> 

> 

> counter_1

> 

> numeric(20,0)

> 

> NA

> 

> measure

> 

> 

> 

> ...

> 

> ...

> 

> NA

> 

> ...

> 

> 

> 

> 

> counter_100

> 

> numeric(20,0)

> 

> NA

> 

> measure

> 

> 

> 

> We have about more than 50 test cases; according to the test case we
> summarize some suggestion to create the table which can have a better
> query performance. 
> </br>
> 1.    Put the frequently-used column filter in the beginning. 
> </br>
> For example, MSISDN filter is used in most of the query then put the
> MSISDN in the first column. The create table command can be as follows,
> the query which has MSISDN as a filter will be good (because the MSISDN is
> high cardinality, if create table like this the compress ratio will be
> decreased)
> </br>
/
> create table carbondata_table(
/
> </br>
/
> msisdn String,
/
> </br>
/
> ...
/
> </br>
/
> )STORED BY 'org.apache.carbondata.format' 
/
> </br>
/
> TBLPROPERTIES (
> 'DICTIONARY_EXCLUDE'='MSISDN,..','DICTIONARY_INCLUDE'='...');
/
> </br>
> </br>

> 2.    If has multiple column which is frequently-use in the filter, put it to
> the front in the order as low cardinality to high cardinality.
> </br>
> For example if msisdn, host and dime_1 is frequently-used column, the
> table column order can be like dime_1->host->msisdn, because the dime_1
> cardinality is low. Create table command can be as follows. This will
> increase the compression ratio and good performance for filter on dime_1,
> host and msisdn.
> </br>
>  
/
> create table carbondata_table(
> </br>
> Dime_1 String,
> </br>
> HOST String,
> </br>
> MSISDN String,
> </br>
> ...
> </br>
> )STORED BY 'org.apache.carbondata.format' 
> </br>
/
> TBLPROPERTIES (
> 'DICTIONARY_EXCLUDE'='MSISDN,HOST..','DICTIONARY_INCLUDE'='Dime_1..');
> </br>
> </br>

> 3.    If no column is frequent-use in filter, then can put all the dimension
> column order as from low cardinality to high cardinality. Create table
> command can be as following: 
> </br>
/
> create table carbondata_table(
> </br>
> Dime_1 String,
> </br>
> BEGIN_TIME bigint
> </br>
> HOST String,
> </br>
> MSISDN String,
> </br>
> ...
> </br>
> )STORED BY 'org.apache.carbondata.format' 
> </br>
> TBLPROPERTIES (
> 'DICTIONARY_EXCLUDE'='MSISDN,HOST,IMSI..','DICTIONARY_INCLUDE'='Dime_1,END_TIME,BEGIN_TIME..');
> </br>
/
> </br>

> 4.    For measure that needs no high accuracy, then no need to use
> numeric(20,0) data type, suggestion is to use double to replace it than
> will increase the query performance. If one test case uses double to
> replace the numeric (20, 0) the query improve 5 times from 15 second to 3
> second. Create table command can be as follows. 
> </br>
/
> create table carbondata_table(
> </br>
> Dime_1 String,
> </br>
> BEGIN_TIME bigint
> </br>
> HOST String,
> </br>
> MSISDN String,
> </br>
> counter_1 double,
> </br>
> counter_2 double,
> </br>
> ...
> </br>
> counter_100 double,
> </br>
> )STORED BY 'org.apache.carbondata.format' 
> </br>
> TBLPROPERTIES (
> 'DICTIONARY_EXCLUDE'='MSISDN,HOST,IMSI','DICTIONARY_INCLUDE'='Dime_1,END_TIME,BEGIN_TIME');
> </br>
/
> </br>

> 5.    If the column which is always incremental like start_time. For example
> one scenario: every day we will load data into carbon and the start_time
> is incremental for each load. For this scenario you can put the start_time
> column in the back of dimension, because always incremental value can use
> the min/max index well always. Create table command can be as following. 
> </br>
/
> create table carbondata_table(
> </br>
> Dime_1 String,
> </br>
> HOST String,
> </br>
> MSISDN String,
> </br>
> counter_1 double,
> </br>
> counter_2 double,
> </br>
> BEGIN_TIME bigint,
> </br>
> ...
> </br>
> counter_100 double,
> </br>
> )STORED BY 'org.apache.carbondata.format' 
> </br>
> TBLPROPERTIES (
> 'DICTIONARY_EXCLUDE'='MSISDN,HOST,IMSI','DICTIONARY_INCLUDE'='Dime_1,END_TIME,BEGIN_TIME');
> </br>
/
> 
> </br>

> One more is for the dimension whether dictionary is needed or not, we
> suggest if the cardinality higher than 50 thousand do not put it as
> dictionary column. If high cardinality column put as dictionary will
> impact the load performance.





--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Discussion-how-to-crate-the-CarbonData-table-with-good-performance-tp1977p1982.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.

Reply via email to