Hi Thanks for you shared these experience. Can you put these FAQ to CWIKI: https://cwiki.apache.org/confluence/display/CARBONDATA/CarbonData+Home
Regards Liang bill.zhou wrote > Discussion how to crate the CarbonData table with good performance > Suggestion to create Carbon table > Recently we used CarbonData to do the performance in Telecommunication > filed and summarize some of the Suggestions while creating the CarbonData > table. > </br> > We have tables which range from 10 thousand rows to 10 billion rows and > have from 100 columns to 300 columns. Following are some of the columns > used in the table. > > > Column name > > Data type > > Cardinality > > Attribution > > > > msisdn > > String > > 30 million > > dimension > > > > BEGIN_TIME > > bigint > > 10 thousand > > dimension > > > > HOST > > String > > 1 million > > dimension > > > > Dime_1 > > String > > 1 thousand > > dimension > > > > counter_1 > > numeric(20,0) > > NA > > measure > > > > ... > > ... > > NA > > ... > > > > > counter_100 > > numeric(20,0) > > NA > > measure > > > > We have about more than 50 test cases; according to the test case we > summarize some suggestion to create the table which can have a better > query performance. > </br> > 1. Put the frequently-used column filter in the beginning. > </br> > For example, MSISDN filter is used in most of the query then put the > MSISDN in the first column. The create table command can be as follows, > the query which has MSISDN as a filter will be good (because the MSISDN is > high cardinality, if create table like this the compress ratio will be > decreased) > </br> / > create table carbondata_table( / > </br> / > msisdn String, / > </br> / > ... / > </br> / > )STORED BY 'org.apache.carbondata.format' / > </br> / > TBLPROPERTIES ( > 'DICTIONARY_EXCLUDE'='MSISDN,..','DICTIONARY_INCLUDE'='...'); / > </br> > </br> > 2. If has multiple column which is frequently-use in the filter, put it to > the front in the order as low cardinality to high cardinality. > </br> > For example if msisdn, host and dime_1 is frequently-used column, the > table column order can be like dime_1->host->msisdn, because the dime_1 > cardinality is low. Create table command can be as follows. This will > increase the compression ratio and good performance for filter on dime_1, > host and msisdn. > </br> > / > create table carbondata_table( > </br> > Dime_1 String, > </br> > HOST String, > </br> > MSISDN String, > </br> > ... > </br> > )STORED BY 'org.apache.carbondata.format' > </br> / > TBLPROPERTIES ( > 'DICTIONARY_EXCLUDE'='MSISDN,HOST..','DICTIONARY_INCLUDE'='Dime_1..'); > </br> > </br> > 3. If no column is frequent-use in filter, then can put all the dimension > column order as from low cardinality to high cardinality. Create table > command can be as following: > </br> / > create table carbondata_table( > </br> > Dime_1 String, > </br> > BEGIN_TIME bigint > </br> > HOST String, > </br> > MSISDN String, > </br> > ... > </br> > )STORED BY 'org.apache.carbondata.format' > </br> > TBLPROPERTIES ( > 'DICTIONARY_EXCLUDE'='MSISDN,HOST,IMSI..','DICTIONARY_INCLUDE'='Dime_1,END_TIME,BEGIN_TIME..'); > </br> / > </br> > 4. For measure that needs no high accuracy, then no need to use > numeric(20,0) data type, suggestion is to use double to replace it than > will increase the query performance. If one test case uses double to > replace the numeric (20, 0) the query improve 5 times from 15 second to 3 > second. Create table command can be as follows. > </br> / > create table carbondata_table( > </br> > Dime_1 String, > </br> > BEGIN_TIME bigint > </br> > HOST String, > </br> > MSISDN String, > </br> > counter_1 double, > </br> > counter_2 double, > </br> > ... > </br> > counter_100 double, > </br> > )STORED BY 'org.apache.carbondata.format' > </br> > TBLPROPERTIES ( > 'DICTIONARY_EXCLUDE'='MSISDN,HOST,IMSI','DICTIONARY_INCLUDE'='Dime_1,END_TIME,BEGIN_TIME'); > </br> / > </br> > 5. If the column which is always incremental like start_time. For example > one scenario: every day we will load data into carbon and the start_time > is incremental for each load. For this scenario you can put the start_time > column in the back of dimension, because always incremental value can use > the min/max index well always. Create table command can be as following. > </br> / > create table carbondata_table( > </br> > Dime_1 String, > </br> > HOST String, > </br> > MSISDN String, > </br> > counter_1 double, > </br> > counter_2 double, > </br> > BEGIN_TIME bigint, > </br> > ... > </br> > counter_100 double, > </br> > )STORED BY 'org.apache.carbondata.format' > </br> > TBLPROPERTIES ( > 'DICTIONARY_EXCLUDE'='MSISDN,HOST,IMSI','DICTIONARY_INCLUDE'='Dime_1,END_TIME,BEGIN_TIME'); > </br> / > > </br> > One more is for the dimension whether dictionary is needed or not, we > suggest if the cardinality higher than 50 thousand do not put it as > dictionary column. If high cardinality column put as dictionary will > impact the load performance. -- View this message in context: http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/Discussion-how-to-crate-the-CarbonData-table-with-good-performance-tp1977p1982.html Sent from the Apache CarbonData Mailing List archive mailing list archive at Nabble.com.