[jira] [Commented] (CARBONDATA-4106) Compaction is not working properly
[ https://issues.apache.org/jira/browse/CARBONDATA-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17265783#comment-17265783 ] Ajantha Bhat commented on CARBONDATA-4106: -- For your case, each load is mapped to one partition folder (different than previous loads), compaction on partition table can only merge within a partition. So, for you it will not combine across partition and table after and before compaction looks same. If your load has multiple partition values and next loads has previous loads partition values, then only compaction can be useful > Compaction is not working properly > -- > > Key: CARBONDATA-4106 > URL: https://issues.apache.org/jira/browse/CARBONDATA-4106 > Project: CarbonData > Issue Type: Improvement > Components: core >Affects Versions: 2.0.1 > Environment: Apache spark 2.4.5, carbonData 2.0.1 >Reporter: suyash yadav >Priority: Major > Fix For: 2.0.1 > > Attachments: describe_fact_probe_1 > > > Hi Team, > We are using apache carbondata 2.0.1 for one of our POC and we observed that > we are not getting proper benifit from using compaction (Both majour and > minor). > Please find below details for the issue we are facing: > *Name of the table used*: fact_365_1_probe_1 > +*Number of rows:* > + > select count(*) from fact_365_1_probe_1 > ++ > |count(1)| > ++ > |76963753| > *Sample data from the table:* > == > +---+--++--+-+---+ > | ts| metric| tags_id| value| epoch| ts2| > > +---+--++--+-+---+ > |2021-01-07 > 21:05:00|Probe.Duplicate.Poll.Count|c8dead9b-87ae-46ae-8703-bc2b7bfba5d4|39.611356797970274|1610033757768|2021-01-07 > 00:00:00| > |2021-01-07 > 23:50:00|Probe.Duplicate.Poll.Count|62351ef2-f2ce-49d1-a2fd-a0d1e5f6a1b9| > 72.70658115131307|1610043742516|2021-01-07 00:00:00| > > [^describe_fact_probe_1] > > I have attached the describe output which will show you the other details of > the table. > The size of the table is 3.24 GB and even after running minor or majour > compaction the size remain almost the same. > So we re not getting any benifit by running the compaction.Could you please > review the shared details and help us in identifying if we are missing > something here or is there any bug? > Also we need answer to the following questions about carbondata storate: > 1. In case of decimal values, how the storage behaves like if i have one row > with 20 digits after decimal and second row has only 5 digits after decimal > so how and what would be the difference in the storage taken. > 2. My second question is , if i have two tables and one of the table has same > values for 100 rows and other table has different values for 100 rows so how > carbon will behave as far as the storage is concerned in this scenario. WHich > table will take less storage or both will take same storage. > 3.Also for string datatype could you please describe what is the storage > defined for string datatype. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (CARBONDATA-4106) Compaction is not working properly
[ https://issues.apache.org/jira/browse/CARBONDATA-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17265778#comment-17265778 ] Ajantha Bhat edited comment on CARBONDATA-4106 at 1/15/21, 7:33 AM: 0. Compaction cannot guarantee the reduction in table size, it can only merge the small files into big files. (with this IO time can be reduced during query) Reducing the total table size depends on many factors including data cardinality. Also, your table is already partitioned, compaction will try to merge the segments within the same partition, so will not make much difference for a few segments. Also will not combine files if each load goes into a different partition folder (value) than the previous load. 1. In a column, we group 32000 rows into pages, so the final storage data type depends on all the values in the column page. We try to apply adaptive and delta encoding for these 32000 values to try to store them in less space than the actual data type. 2. table will same values as 100 rows will be smaller in storage as we do RLE encoding and compression. 3. By default string undergoes dictionary encoding, we store and encoding INT values. If the cardinality in a blocklet is more than 1, cannot use dictionary. That time we fallback to storing as string itself in a byte array format. was (Author: ajantha_bhat): 0. Compaction cannot guarantee the reduction in table size, it can only merge the small files into big files. (with this IO time can be reduced during query) Reducing the total table size depends on many factors including data cardinality. Also, your table is already partitioned, compaction will try to merge the segments within the same partition, so will not make much difference for a few segments. 1. In a column, we group 32000 rows into pages, so the final storage data type depends on all the values in the column page. We try to apply adaptive and delta encoding for these 32000 values to try to store them in less space than the actual data type. 2. table will same values as 100 rows will be smaller in storage as we do RLE encoding and compression. 3. By default string undergoes dictionary encoding, we store and encoding INT values. If the cardinality in a blocklet is more than 1, cannot use dictionary. That time we fallback to storing as string itself in a byte array format. > Compaction is not working properly > -- > > Key: CARBONDATA-4106 > URL: https://issues.apache.org/jira/browse/CARBONDATA-4106 > Project: CarbonData > Issue Type: Improvement > Components: core >Affects Versions: 2.0.1 > Environment: Apache spark 2.4.5, carbonData 2.0.1 >Reporter: suyash yadav >Priority: Major > Fix For: 2.0.1 > > Attachments: describe_fact_probe_1 > > > Hi Team, > We are using apache carbondata 2.0.1 for one of our POC and we observed that > we are not getting proper benifit from using compaction (Both majour and > minor). > Please find below details for the issue we are facing: > *Name of the table used*: fact_365_1_probe_1 > +*Number of rows:* > + > select count(*) from fact_365_1_probe_1 > ++ > |count(1)| > ++ > |76963753| > *Sample data from the table:* > == > +---+--++--+-+---+ > | ts| metric| tags_id| value| epoch| ts2| > > +---+--++--+-+---+ > |2021-01-07 > 21:05:00|Probe.Duplicate.Poll.Count|c8dead9b-87ae-46ae-8703-bc2b7bfba5d4|39.611356797970274|1610033757768|2021-01-07 > 00:00:00| > |2021-01-07 > 23:50:00|Probe.Duplicate.Poll.Count|62351ef2-f2ce-49d1-a2fd-a0d1e5f6a1b9| > 72.70658115131307|1610043742516|2021-01-07 00:00:00| > > [^describe_fact_probe_1] > > I have attached the describe output which will show you the other details of > the table. > The size of the table is 3.24 GB and even after running minor or majour > compaction the size remain almost the same. > So we re not getting any benifit by running the compaction.Could you please > review the shared details and help us in identifying if we are missing > something here or is there any bug? > Also we need answer to the following questions about carbondata storate: > 1. In case of decimal values, how the storage behaves like if i have one row > with 20 digits after decimal and second row has only 5 digits after decimal > so how and what would be the difference in the storage taken. > 2. My second question is , if i have two tables and one of the table has same > values for 100 rows and other table has different values for 100 rows so how > carbon will
[jira] [Comment Edited] (CARBONDATA-4106) Compaction is not working properly
[ https://issues.apache.org/jira/browse/CARBONDATA-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17265778#comment-17265778 ] Ajantha Bhat edited comment on CARBONDATA-4106 at 1/15/21, 7:30 AM: 0. Compaction cannot guarantee the reduction in table size, it can only merge the small files into big files. (with this IO time can be reduced during query) Reducing the total table size depends on many factors including data cardinality. Also, your table is already partitioned, compaction will try to merge the segments within the same partition, so will not make much difference for a few segments. 1. In a column, we group 32000 rows into pages, so the final storage data type depends on all the values in the column page. We try to apply adaptive and delta encoding for these 32000 values to try to store them in less space than the actual data type. 2. table will same values as 100 rows will be smaller in storage as we do RLE encoding and compression. 3. By default string undergoes dictionary encoding, we store and encoding INT values. If the cardinality in a blocklet is more than 1, cannot use dictionary. That time we fallback to storing as string itself in a byte array format. was (Author: ajantha_bhat): 0. Compaction cannot gaurentee the reduction in table size, it can only merge the small files into big files. (with this IO time can be reduced during query) Reducing the total table size depends on many factors including data cardinality. Also your table is already partitioned, compaction will try to merge the segments within the same parition, so will not make much difference for few segments. 1. In a column, we group 32000 rows into pages, so final storage data type depends on the all the values in the column page. We try to apply adaptive and delte encoding for this 32000 values to try to store it in less space than actual data type. 2. table will same values as 100 rows will be smaller in storage as we do RLE encoding and compression. 3. By default string undergoes dictionary encoding, we store and encoding INT values. If the cardinality in a blocklet is more than 1, then cannot use dictionary. That time we fallback to storing as string itself as a byte array format. > Compaction is not working properly > -- > > Key: CARBONDATA-4106 > URL: https://issues.apache.org/jira/browse/CARBONDATA-4106 > Project: CarbonData > Issue Type: Improvement > Components: core >Affects Versions: 2.0.1 > Environment: Apache spark 2.4.5, carbonData 2.0.1 >Reporter: suyash yadav >Priority: Major > Fix For: 2.0.1 > > Attachments: describe_fact_probe_1 > > > Hi Team, > We are using apache carbondata 2.0.1 for one of our POC and we observed that > we are not getting proper benifit from using compaction (Both majour and > minor). > Please find below details for the issue we are facing: > *Name of the table used*: fact_365_1_probe_1 > +*Number of rows:* > + > select count(*) from fact_365_1_probe_1 > ++ > |count(1)| > ++ > |76963753| > *Sample data from the table:* > == > +---+--++--+-+---+ > | ts| metric| tags_id| value| epoch| ts2| > > +---+--++--+-+---+ > |2021-01-07 > 21:05:00|Probe.Duplicate.Poll.Count|c8dead9b-87ae-46ae-8703-bc2b7bfba5d4|39.611356797970274|1610033757768|2021-01-07 > 00:00:00| > |2021-01-07 > 23:50:00|Probe.Duplicate.Poll.Count|62351ef2-f2ce-49d1-a2fd-a0d1e5f6a1b9| > 72.70658115131307|1610043742516|2021-01-07 00:00:00| > > [^describe_fact_probe_1] > > I have attached the describe output which will show you the other details of > the table. > The size of the table is 3.24 GB and even after running minor or majour > compaction the size remain almost the same. > So we re not getting any benifit by running the compaction.Could you please > review the shared details and help us in identifying if we are missing > something here or is there any bug? > Also we need answer to the following questions about carbondata storate: > 1. In case of decimal values, how the storage behaves like if i have one row > with 20 digits after decimal and second row has only 5 digits after decimal > so how and what would be the difference in the storage taken. > 2. My second question is , if i have two tables and one of the table has same > values for 100 rows and other table has different values for 100 rows so how > carbon will behave as far as the storage is concerned in this scenario. WHich > table will take less storage or both will take same
[jira] [Commented] (CARBONDATA-4106) Compaction is not working properly
[ https://issues.apache.org/jira/browse/CARBONDATA-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17265778#comment-17265778 ] Ajantha Bhat commented on CARBONDATA-4106: -- 0. Compaction cannot gaurentee the reduction in table size, it can only merge the small files into big files. (with this IO time can be reduced during query) Reducing the total table size depends on many factors including data cardinality. Also your table is already partitioned, compaction will try to merge the segments within the same parition, so will not make much difference for few segments. 1. In a column, we group 32000 rows into pages, so final storage data type depends on the all the values in the column page. We try to apply adaptive and delte encoding for this 32000 values to try to store it in less space than actual data type. 2. table will same values as 100 rows will be smaller in storage as we do RLE encoding and compression. 3. By default string undergoes dictionary encoding, we store and encoding INT values. If the cardinality in a blocklet is more than 1, then cannot use dictionary. That time we fallback to storing as string itself as a byte array format. > Compaction is not working properly > -- > > Key: CARBONDATA-4106 > URL: https://issues.apache.org/jira/browse/CARBONDATA-4106 > Project: CarbonData > Issue Type: Improvement > Components: core >Affects Versions: 2.0.1 > Environment: Apache spark 2.4.5, carbonData 2.0.1 >Reporter: suyash yadav >Priority: Major > Fix For: 2.0.1 > > Attachments: describe_fact_probe_1 > > > Hi Team, > We are using apache carbondata 2.0.1 for one of our POC and we observed that > we are not getting proper benifit from using compaction (Both majour and > minor). > Please find below details for the issue we are facing: > *Name of the table used*: fact_365_1_probe_1 > +*Number of rows:* > + > select count(*) from fact_365_1_probe_1 > ++ > |count(1)| > ++ > |76963753| > *Sample data from the table:* > == > +---+--++--+-+---+ > | ts| metric| tags_id| value| epoch| ts2| > > +---+--++--+-+---+ > |2021-01-07 > 21:05:00|Probe.Duplicate.Poll.Count|c8dead9b-87ae-46ae-8703-bc2b7bfba5d4|39.611356797970274|1610033757768|2021-01-07 > 00:00:00| > |2021-01-07 > 23:50:00|Probe.Duplicate.Poll.Count|62351ef2-f2ce-49d1-a2fd-a0d1e5f6a1b9| > 72.70658115131307|1610043742516|2021-01-07 00:00:00| > > [^describe_fact_probe_1] > > I have attached the describe output which will show you the other details of > the table. > The size of the table is 3.24 GB and even after running minor or majour > compaction the size remain almost the same. > So we re not getting any benifit by running the compaction.Could you please > review the shared details and help us in identifying if we are missing > something here or is there any bug? > Also we need answer to the following questions about carbondata storate: > 1. In case of decimal values, how the storage behaves like if i have one row > with 20 digits after decimal and second row has only 5 digits after decimal > so how and what would be the difference in the storage taken. > 2. My second question is , if i have two tables and one of the table has same > values for 100 rows and other table has different values for 100 rows so how > carbon will behave as far as the storage is concerned in this scenario. WHich > table will take less storage or both will take same storage. > 3.Also for string datatype could you please describe what is the storage > defined for string datatype. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (CARBONDATA-4108) How to connect carbondata with Hive
suyash yadav created CARBONDATA-4108: Summary: How to connect carbondata with Hive Key: CARBONDATA-4108 URL: https://issues.apache.org/jira/browse/CARBONDATA-4108 Project: CarbonData Issue Type: Improvement Components: core Affects Versions: 2.0.1 Environment: apache carbondata 2.0.1, spark 2.4.5, Hive 2.0 Reporter: suyash yadav Fix For: 2.0.1 Hi Team, We would like to know how to connect hive with carbondata.We are doing a POC where in we need to access carbondata table through hive but we need this configuration with username and password. So our hive connection should have some username and password configuration to connect to carbondata tables. Could you guys please review above requirement and suggest steps to achieve the same. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [carbondata] MarvinLitt commented on a change in pull request #4077: [DOC] Running the Thrift JDBC/ODBC server with CarbonExtensions
MarvinLitt commented on a change in pull request #4077: URL: https://github.com/apache/carbondata/pull/4077#discussion_r557878565 ## File path: docs/quick-start-guide.md ## @@ -325,9 +325,17 @@ mv carbondata.tar.gz carbonlib/ -## Query Execution Using CarbonData Thrift Server +## Query Execution Using the Thrift Server -### Starting CarbonData Thrift Server. +### Option 1: Starting Thrift Server with CarbonExtensions(since 2.0) +``` +cd $SPARK_HOME +./sbin/start-thriftserver.sh \ +--conf spark.sql.extensions=org.apache.spark.sql.CarbonExtensions \ +$SPARK_HOME/carbonlib/apache-carbondata-xxx.jar +``` Review comment: Users can easily use start- thriftserver.sh. It's great! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] Karan980 edited a comment on pull request #4075: [CARBONDATA-4105]Select * query fails after a SDK written segment is added by alter table add segment query.
Karan980 edited a comment on pull request #4075: URL: https://github.com/apache/carbondata/pull/4075#issuecomment-760659123 > do you have the reproduce steps? Yes, run this test case which i have added in this PR without any other changes from the PR. (Test add segment by carbon written by SDK on which read is already performed). At last, in place of select count(*), just run select * query. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] Karan980 commented on pull request #4075: [CARBONDATA-4105]Select * query fails after a SDK written segment is added by alter table add segment query.
Karan980 commented on pull request #4075: URL: https://github.com/apache/carbondata/pull/4075#issuecomment-760659123 > do you have the reproduce steps? Yes, run this test case which i have added in this PR without any other changes from the PR. (Test add segment by carbon written by SDK on which read is already performed) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] akashrn5 commented on pull request #4034: [CARBONDATA-4091] support prestosql 333 integartion with carbon
akashrn5 commented on pull request #4034: URL: https://github.com/apache/carbondata/pull/4034#issuecomment-760656532 LGTM, @jackylk can review it once This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4077: [DOC] Running the Thrift JDBC/ODBC server with CarbonExtensions
CarbonDataQA2 commented on pull request #4077: URL: https://github.com/apache/carbondata/pull/4077#issuecomment-760649473 Build Success with Spark 2.3.4, Please check CI http://121.244.95.60:12444/job/ApacheCarbonPRBuilder2.3/5312/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4077: [DOC] Running the Thrift JDBC/ODBC server with CarbonExtensions
CarbonDataQA2 commented on pull request #4077: URL: https://github.com/apache/carbondata/pull/4077#issuecomment-760648968 Build Success with Spark 2.4.5, Please check CI http://121.244.95.60:12444/job/ApacheCarbon_PR_Builder_2.4.5/3552/ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] Indhumathi27 commented on pull request #4076: [WIP][CARBONDATA-4107] Added mvExists property for MV fact table and added lock while touchMDTFile
Indhumathi27 commented on pull request #4076: URL: https://github.com/apache/carbondata/pull/4076#issuecomment-760646765 @QiangCai Even if we list all mv tables in fact table, we need the database information in order to fetch schema files. and also, a table can have more number of MV's, which can increase size of fact table This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] QiangCai commented on pull request #4075: [CARBONDATA-4105]Select * query fails after a SDK written segment is added by alter table add segment query.
QiangCai commented on pull request #4075: URL: https://github.com/apache/carbondata/pull/4075#issuecomment-760630120 do you have the reproduce steps? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] QiangCai commented on pull request #4076: [WIP][CARBONDATA-4107] Added mvExists property for MV fact table and added lock while touchMDTFile
QiangCai commented on pull request #4076: URL: https://github.com/apache/carbondata/pull/4076#issuecomment-760627378 how about to list all mv table in fact table properties? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [carbondata] QiangCai opened a new pull request #4077: [DOC] Running the Thrift DBC/ODBC server with CarbonExtensions
QiangCai opened a new pull request #4077: URL: https://github.com/apache/carbondata/pull/4077 ### Why is this PR needed? since version 2.0, carbon supports starting spark ThriftServer with CarbonExtensions. ### What changes were proposed in this PR? add the document to start spark ThriftServer with CarbonExtensions. ### Does this PR introduce any user interface change? - No ### Is any new testcase added? - No This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org