[GitHub] carbondata pull request #1534: [CARBONDATA-1770] Update error docs and conso...
Github user asfgit closed the pull request at: https://github.com/apache/carbondata/pull/1534 ---
[GitHub] carbondata pull request #1534: [CARBONDATA-1770] Update error docs and conso...
Github user chenliang613 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/1534#discussion_r152589826 --- Diff: docs/data-management-on-carbondata.md --- @@ -0,0 +1,713 @@ + + +# Data Management on CarbonData + +This tutorial is going to introduce all commands and data operations on CarbonData. + +* [CREATE TABLE](#create-table) +* [TABLE MANAGEMENT](#table-management) +* [LOAD DATA](#load-data) +* [UPDATE AND DELETE](#update-and-delete) +* [COMPACTION](#compaction) +* [PARTITION](#partition) +* [BUCKETING](#bucketing) +* [SEGMENT MANAGEMENT](#segment-management) + +## CREATE TABLE + + This command can be used to create a CarbonData table by specifying the list of fields along with the table properties. + + ``` + CREATE TABLE [IF NOT EXISTS] [db_name.]table_name[(col_name data_type , ...)] + STORED BY 'carbondata' + [TBLPROPERTIES (property_name=property_value, ...)] + ``` + +### Usage Guidelines + + Following are the guidelines for TBLPROPERTIES, CarbonData's additional table options can be set via carbon.properties. + + - **Dictionary Encoding Configuration** + + Dictionary encoding is turned off for all columns by default from 1.3 onwards, you can use this command for including columns to do dictionary encoding. + Suggested use cases : do dictionary encoding for low cardinality columns, it might help to improve data compression ratio and performance. + + ``` + TBLPROPERTIES ('DICTIONARY_INCLUDE'='column1, column2') + ``` + + - **Inverted Index Configuration** + + By default inverted index is enabled, it might help to improve compression ratio and query speed, especially for low cardinality columns which are in reward position. + Suggested use cases : For high cardinality columns, you can disable the inverted index for improving the data loading performance. + + ``` + TBLPROPERTIES ('NO_INVERTED_INDEX'='column1, column3') + ``` + + - **Sort Columns Configuration** + + This property is for users to specify which columns belong to the MDK(Multi-Dimensions-Key) index. + * If users don't specify "SORT_COLUMN" property, by default MDK index be built by using all dimension columns except complex datatype column. + * If this property is specified but with empty argument, then the table will be loaded without sort.. + Suggested use cases : Only build MDK index for required columns,it might help to improve the data loading performance. + + ``` + TBLPROPERTIES ('SORT_COLUMNS'='column1, column3') + OR + TBLPROPERTIES ('SORT_COLUMNS'='') + ``` + + - **Sort Scope Configuration** + + This property is for users to specify the scope of the sort during data load, following are the types of sort scope. + + * LOCAL_SORT: It is the default sort scope. + * NO_SORT: It will load the data in unsorted manner, it will significantly increase load performance. + * BATCH_SORT: It increases the load performance but decreases the query performance if identified blocks > parallelism. + * GLOBAL_SORT: It increases the query performance, especially high concurrent point query. + And if you care about loading resources isolation strictly, because the system uses the spark GroupBy to sort data, the resource can be controlled by spark. + + - **Table Block Size Configuration** + + This command is for setting block size of this table, the default value is 1024 MB and supports a range of 1 MB to 2048 MB. + + ``` + TBLPROPERTIES ('TABLE_BLOCKSIZE'='512') + //512 or 512M both are accepted. --- End diff -- accept, fixed. ---
[GitHub] carbondata pull request #1534: [CARBONDATA-1770] Update error docs and conso...
Github user chenliang613 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/1534#discussion_r152589597 --- Diff: docs/data-management-on-carbondata.md --- @@ -0,0 +1,713 @@ + + +# Data Management on CarbonData + +This tutorial is going to introduce all commands and data operations on CarbonData. + +* [CREATE TABLE](#create-table) +* [TABLE MANAGEMENT](#table-management) +* [LOAD DATA](#load-data) +* [UPDATE AND DELETE](#update-and-delete) +* [COMPACTION](#compaction) +* [PARTITION](#partition) +* [BUCKETING](#bucketing) +* [SEGMENT MANAGEMENT](#segment-management) + +## CREATE TABLE + + This command can be used to create a CarbonData table by specifying the list of fields along with the table properties. + + ``` + CREATE TABLE [IF NOT EXISTS] [db_name.]table_name[(col_name data_type , ...)] + STORED BY 'carbondata' + [TBLPROPERTIES (property_name=property_value, ...)] + ``` + +### Usage Guidelines + + Following are the guidelines for TBLPROPERTIES, CarbonData's additional table options can be set via carbon.properties. + + - **Dictionary Encoding Configuration** + + Dictionary encoding is turned off for all columns by default from 1.3 onwards, you can use this command for including columns to do dictionary encoding. + Suggested use cases : do dictionary encoding for low cardinality columns, it might help to improve data compression ratio and performance. + + ``` + TBLPROPERTIES ('DICTIONARY_INCLUDE'='column1, column2') + ``` + + - **Inverted Index Configuration** + + By default inverted index is enabled, it might help to improve compression ratio and query speed, especially for low cardinality columns which are in reward position. + Suggested use cases : For high cardinality columns, you can disable the inverted index for improving the data loading performance. + + ``` + TBLPROPERTIES ('NO_INVERTED_INDEX'='column1, column3') + ``` + + - **Sort Columns Configuration** + + This property is for users to specify which columns belong to the MDK(Multi-Dimensions-Key) index. + * If users don't specify "SORT_COLUMN" property, by default MDK index be built by using all dimension columns except complex datatype column. + * If this property is specified but with empty argument, then the table will be loaded without sort.. + Suggested use cases : Only build MDK index for required columns,it might help to improve the data loading performance. + + ``` + TBLPROPERTIES ('SORT_COLUMNS'='column1, column3') + OR + TBLPROPERTIES ('SORT_COLUMNS'='') + ``` + + - **Sort Scope Configuration** + + This property is for users to specify the scope of the sort during data load, following are the types of sort scope. + + * LOCAL_SORT: It is the default sort scope. + * NO_SORT: It will load the data in unsorted manner, it will significantly increase load performance. + * BATCH_SORT: It increases the load performance but decreases the query performance if identified blocks > parallelism. + * GLOBAL_SORT: It increases the query performance, especially high concurrent point query. + And if you care about loading resources isolation strictly, because the system uses the spark GroupBy to sort data, the resource can be controlled by spark. + + - **Table Block Size Configuration** + + This command is for setting block size of this table, the default value is 1024 MB and supports a range of 1 MB to 2048 MB. + + ``` + TBLPROPERTIES ('TABLE_BLOCKSIZE'='512') + //512 or 512M both are accepted. + ``` + +### Example: +``` +CREATE TABLE IF NOT EXISTS productSchema.productSalesTable ( + productNumber Int, + productName String, + storeCity String, + storeProvince String, + productCategory String, + productBatch String, + saleQuantity Int, + revenue Int) +STORED BY 'carbondata' +TBLPROPERTIES ('DICTIONARY_INCLUDE'='productNumber', + 'NO_INVERTED_INDEX'='productBatch', + 'SORT_COLUMNS'='productName,storeCity', + 'SORT_SCOPE'='NO_SORT', + 'TABLE_BLOCKSIZE'='512') +``` + +## TABLE MANAGEMENT + +### SHOW TABLE + + This command can be
[GitHub] carbondata pull request #1534: [CARBONDATA-1770] Update error docs and conso...
Github user vandana7 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/1534#discussion_r152549583 --- Diff: docs/data-management-on-carbondata.md --- @@ -0,0 +1,713 @@ + + +# Data Management on CarbonData + +This tutorial is going to introduce all commands and data operations on CarbonData. + +* [CREATE TABLE](#create-table) +* [TABLE MANAGEMENT](#table-management) +* [LOAD DATA](#load-data) +* [UPDATE AND DELETE](#update-and-delete) +* [COMPACTION](#compaction) +* [PARTITION](#partition) +* [BUCKETING](#bucketing) +* [SEGMENT MANAGEMENT](#segment-management) + +## CREATE TABLE + + This command can be used to create a CarbonData table by specifying the list of fields along with the table properties. + + ``` + CREATE TABLE [IF NOT EXISTS] [db_name.]table_name[(col_name data_type , ...)] + STORED BY 'carbondata' + [TBLPROPERTIES (property_name=property_value, ...)] + ``` + +### Usage Guidelines + + Following are the guidelines for TBLPROPERTIES, CarbonData's additional table options can be set via carbon.properties. + + - **Dictionary Encoding Configuration** + + Dictionary encoding is turned off for all columns by default from 1.3 onwards, you can use this command for including columns to do dictionary encoding. + Suggested use cases : do dictionary encoding for low cardinality columns, it might help to improve data compression ratio and performance. + + ``` + TBLPROPERTIES ('DICTIONARY_INCLUDE'='column1, column2') + ``` + + - **Inverted Index Configuration** + + By default inverted index is enabled, it might help to improve compression ratio and query speed, especially for low cardinality columns which are in reward position. + Suggested use cases : For high cardinality columns, you can disable the inverted index for improving the data loading performance. + + ``` + TBLPROPERTIES ('NO_INVERTED_INDEX'='column1, column3') + ``` + + - **Sort Columns Configuration** + + This property is for users to specify which columns belong to the MDK(Multi-Dimensions-Key) index. + * If users don't specify "SORT_COLUMN" property, by default MDK index be built by using all dimension columns except complex datatype column. + * If this property is specified but with empty argument, then the table will be loaded without sort.. + Suggested use cases : Only build MDK index for required columns,it might help to improve the data loading performance. + + ``` + TBLPROPERTIES ('SORT_COLUMNS'='column1, column3') + OR + TBLPROPERTIES ('SORT_COLUMNS'='') + ``` + + - **Sort Scope Configuration** + + This property is for users to specify the scope of the sort during data load, following are the types of sort scope. + + * LOCAL_SORT: It is the default sort scope. + * NO_SORT: It will load the data in unsorted manner, it will significantly increase load performance. + * BATCH_SORT: It increases the load performance but decreases the query performance if identified blocks > parallelism. + * GLOBAL_SORT: It increases the query performance, especially high concurrent point query. + And if you care about loading resources isolation strictly, because the system uses the spark GroupBy to sort data, the resource can be controlled by spark. + + - **Table Block Size Configuration** + + This command is for setting block size of this table, the default value is 1024 MB and supports a range of 1 MB to 2048 MB. + + ``` + TBLPROPERTIES ('TABLE_BLOCKSIZE'='512') + //512 or 512M both are accepted. + ``` + +### Example: +``` +CREATE TABLE IF NOT EXISTS productSchema.productSalesTable ( + productNumber Int, + productName String, + storeCity String, + storeProvince String, + productCategory String, + productBatch String, + saleQuantity Int, + revenue Int) +STORED BY 'carbondata' +TBLPROPERTIES ('DICTIONARY_INCLUDE'='productNumber', + 'NO_INVERTED_INDEX'='productBatch', + 'SORT_COLUMNS'='productName,storeCity', + 'SORT_SCOPE'='NO_SORT', + 'TABLE_BLOCKSIZE'='512') +``` + +## TABLE MANAGEMENT + +### SHOW TABLE + + This command can be used to
[GitHub] carbondata pull request #1534: [CARBONDATA-1770] Update error docs and conso...
Github user vandana7 commented on a diff in the pull request: https://github.com/apache/carbondata/pull/1534#discussion_r152548815 --- Diff: docs/data-management-on-carbondata.md --- @@ -0,0 +1,713 @@ + + +# Data Management on CarbonData + +This tutorial is going to introduce all commands and data operations on CarbonData. + +* [CREATE TABLE](#create-table) +* [TABLE MANAGEMENT](#table-management) +* [LOAD DATA](#load-data) +* [UPDATE AND DELETE](#update-and-delete) +* [COMPACTION](#compaction) +* [PARTITION](#partition) +* [BUCKETING](#bucketing) +* [SEGMENT MANAGEMENT](#segment-management) + +## CREATE TABLE + + This command can be used to create a CarbonData table by specifying the list of fields along with the table properties. + + ``` + CREATE TABLE [IF NOT EXISTS] [db_name.]table_name[(col_name data_type , ...)] + STORED BY 'carbondata' + [TBLPROPERTIES (property_name=property_value, ...)] + ``` + +### Usage Guidelines + + Following are the guidelines for TBLPROPERTIES, CarbonData's additional table options can be set via carbon.properties. + + - **Dictionary Encoding Configuration** + + Dictionary encoding is turned off for all columns by default from 1.3 onwards, you can use this command for including columns to do dictionary encoding. + Suggested use cases : do dictionary encoding for low cardinality columns, it might help to improve data compression ratio and performance. + + ``` + TBLPROPERTIES ('DICTIONARY_INCLUDE'='column1, column2') + ``` + + - **Inverted Index Configuration** + + By default inverted index is enabled, it might help to improve compression ratio and query speed, especially for low cardinality columns which are in reward position. + Suggested use cases : For high cardinality columns, you can disable the inverted index for improving the data loading performance. + + ``` + TBLPROPERTIES ('NO_INVERTED_INDEX'='column1, column3') + ``` + + - **Sort Columns Configuration** + + This property is for users to specify which columns belong to the MDK(Multi-Dimensions-Key) index. + * If users don't specify "SORT_COLUMN" property, by default MDK index be built by using all dimension columns except complex datatype column. + * If this property is specified but with empty argument, then the table will be loaded without sort.. + Suggested use cases : Only build MDK index for required columns,it might help to improve the data loading performance. + + ``` + TBLPROPERTIES ('SORT_COLUMNS'='column1, column3') + OR + TBLPROPERTIES ('SORT_COLUMNS'='') + ``` + + - **Sort Scope Configuration** + + This property is for users to specify the scope of the sort during data load, following are the types of sort scope. + + * LOCAL_SORT: It is the default sort scope. + * NO_SORT: It will load the data in unsorted manner, it will significantly increase load performance. + * BATCH_SORT: It increases the load performance but decreases the query performance if identified blocks > parallelism. + * GLOBAL_SORT: It increases the query performance, especially high concurrent point query. + And if you care about loading resources isolation strictly, because the system uses the spark GroupBy to sort data, the resource can be controlled by spark. + + - **Table Block Size Configuration** + + This command is for setting block size of this table, the default value is 1024 MB and supports a range of 1 MB to 2048 MB. + + ``` + TBLPROPERTIES ('TABLE_BLOCKSIZE'='512') + //512 or 512M both are accepted. --- End diff -- add a Note tag before writing 512 or 512M both are accepted. as "//" are used in the code for making notes or comments ---
[GitHub] carbondata pull request #1534: [CARBONDATA-1770] Update error docs and conso...
Github user sgururajshetty commented on a diff in the pull request: https://github.com/apache/carbondata/pull/1534#discussion_r152480438 --- Diff: docs/data-management-on-carbondata.md --- @@ -461,25 +461,46 @@ This tutorial is going to introduce all commands and data operations on CarbonDa ## COMPACTION -This command merges the specified number of segments into one segment, compaction help to improve query performance. -``` + Compaction help to improve query performance, because frequently load data, will generate several CarbonData files, because data is sorted only within each load(per load per segment and one B+ tree index). + This means that there will be one index for each load and as number of data load increases, the number of indices also increases. + Compaction feature combines several segments into one large segment by merge sorting the data from across the segments. + + There are two types of compaction Minor and Major compaction. + + ``` ALTER TABLE [db_name.]table_name COMPACT 'MINOR/MAJOR' -``` + ``` - **Minor Compaction** + + In minor compaction the user can specify how many loads to be merged. + Minor compaction triggers for every data load if the parameter carbon.enable.auto.load.merge is set to true. + If any segments are available to be merged, then compaction will run parallel with data load, there are 2 levels in minor compaction: + * Level 1: Merging of the segments which are not yet compacted. + * Level 2: Merging of the compacted segments again to form a bigger segment. + ``` ALTER TABLE table_name COMPACT 'MINOR' ``` - **Major Compaction** + + In Major compaction, many segments can be merged into one big segment. --- End diff -- In Major compaction, multiple segments can be merged into one large segment. ---
[GitHub] carbondata pull request #1534: [CARBONDATA-1770] Update error docs and conso...
Github user sgururajshetty commented on a diff in the pull request: https://github.com/apache/carbondata/pull/1534#discussion_r152480127 --- Diff: docs/data-management-on-carbondata.md --- @@ -461,25 +461,46 @@ This tutorial is going to introduce all commands and data operations on CarbonDa ## COMPACTION -This command merges the specified number of segments into one segment, compaction help to improve query performance. -``` + Compaction help to improve query performance, because frequently load data, will generate several CarbonData files, because data is sorted only within each load(per load per segment and one B+ tree index). + This means that there will be one index for each load and as number of data load increases, the number of indices also increases. + Compaction feature combines several segments into one large segment by merge sorting the data from across the segments. + + There are two types of compaction Minor and Major compaction. --- End diff -- There are two types of copaction, Minor and Major compaction. ---
[GitHub] carbondata pull request #1534: [CARBONDATA-1770] Update error docs and conso...
Github user sgururajshetty commented on a diff in the pull request: https://github.com/apache/carbondata/pull/1534#discussion_r152480541 --- Diff: docs/data-management-on-carbondata.md --- @@ -461,25 +461,46 @@ This tutorial is going to introduce all commands and data operations on CarbonDa ## COMPACTION -This command merges the specified number of segments into one segment, compaction help to improve query performance. -``` + Compaction help to improve query performance, because frequently load data, will generate several CarbonData files, because data is sorted only within each load(per load per segment and one B+ tree index). + This means that there will be one index for each load and as number of data load increases, the number of indices also increases. + Compaction feature combines several segments into one large segment by merge sorting the data from across the segments. + + There are two types of compaction Minor and Major compaction. + + ``` ALTER TABLE [db_name.]table_name COMPACT 'MINOR/MAJOR' -``` + ``` - **Minor Compaction** + + In minor compaction the user can specify how many loads to be merged. + Minor compaction triggers for every data load if the parameter carbon.enable.auto.load.merge is set to true. + If any segments are available to be merged, then compaction will run parallel with data load, there are 2 levels in minor compaction: + * Level 1: Merging of the segments which are not yet compacted. + * Level 2: Merging of the compacted segments again to form a bigger segment. + ``` ALTER TABLE table_name COMPACT 'MINOR' ``` - **Major Compaction** + + In Major compaction, many segments can be merged into one big segment. + User will specify the compaction size until which segments can be merged, Major compaction is usually done during the off-peak time. + This command merges the specified number of segments into one segment: + ``` ALTER TABLE table_name COMPACT 'MAJOR' ``` ## PARTITION + Similar other system's partition features, CarbonData's partition feature can be used to improve query performance by filtering on the partition column. --- End diff -- Similar to other system's partition features, CarbonData's partition feature also can be used to improve query performance by filtering on the partition column. ---
[GitHub] carbondata pull request #1534: [CARBONDATA-1770] Update error docs and conso...
Github user sgururajshetty commented on a diff in the pull request: https://github.com/apache/carbondata/pull/1534#discussion_r152480386 --- Diff: docs/data-management-on-carbondata.md --- @@ -461,25 +461,46 @@ This tutorial is going to introduce all commands and data operations on CarbonDa ## COMPACTION -This command merges the specified number of segments into one segment, compaction help to improve query performance. -``` + Compaction help to improve query performance, because frequently load data, will generate several CarbonData files, because data is sorted only within each load(per load per segment and one B+ tree index). + This means that there will be one index for each load and as number of data load increases, the number of indices also increases. + Compaction feature combines several segments into one large segment by merge sorting the data from across the segments. + + There are two types of compaction Minor and Major compaction. + + ``` ALTER TABLE [db_name.]table_name COMPACT 'MINOR/MAJOR' -``` + ``` - **Minor Compaction** + + In minor compaction the user can specify how many loads to be merged. + Minor compaction triggers for every data load if the parameter carbon.enable.auto.load.merge is set to true. + If any segments are available to be merged, then compaction will run parallel with data load, there are 2 levels in minor compaction: + * Level 1: Merging of the segments which are not yet compacted. + * Level 2: Merging of the compacted segments again to form a bigger segment. --- End diff -- Level 2: Merging of the compacted segments again to form a larger segment. ---
[GitHub] carbondata pull request #1534: [CARBONDATA-1770] Update error docs and conso...
Github user sgururajshetty commented on a diff in the pull request: https://github.com/apache/carbondata/pull/1534#discussion_r152480183 --- Diff: docs/data-management-on-carbondata.md --- @@ -461,25 +461,46 @@ This tutorial is going to introduce all commands and data operations on CarbonDa ## COMPACTION -This command merges the specified number of segments into one segment, compaction help to improve query performance. -``` + Compaction help to improve query performance, because frequently load data, will generate several CarbonData files, because data is sorted only within each load(per load per segment and one B+ tree index). + This means that there will be one index for each load and as number of data load increases, the number of indices also increases. + Compaction feature combines several segments into one large segment by merge sorting the data from across the segments. + + There are two types of compaction Minor and Major compaction. + + ``` ALTER TABLE [db_name.]table_name COMPACT 'MINOR/MAJOR' -``` + ``` - **Minor Compaction** + + In minor compaction the user can specify how many loads to be merged. --- End diff -- In Minor compaction, user can specify the number of loads to be merged. ---
[GitHub] carbondata pull request #1534: [CARBONDATA-1770] Update error docs and conso...
Github user sgururajshetty commented on a diff in the pull request: https://github.com/apache/carbondata/pull/1534#discussion_r152480015 --- Diff: docs/data-management-on-carbondata.md --- @@ -461,25 +461,46 @@ This tutorial is going to introduce all commands and data operations on CarbonDa ## COMPACTION -This command merges the specified number of segments into one segment, compaction help to improve query performance. -``` + Compaction help to improve query performance, because frequently load data, will generate several CarbonData files, because data is sorted only within each load(per load per segment and one B+ tree index). --- End diff -- Compaction improves the query performance significantly. During the load data, several CarbonData files are generated, this is because data is sorted only within each load (per load segment and one B+ tree index). ---