[jira] [Commented] (CARBONDATA-4106) Compaction is not working properly

2021-01-14 Thread Ajantha Bhat (Jira)


[ 
https://issues.apache.org/jira/browse/CARBONDATA-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17265783#comment-17265783
 ] 

Ajantha Bhat commented on CARBONDATA-4106:
--

For your case, each load is mapped to one partition folder (different than 
previous loads), compaction on partition table can only merge within a 
partition. So, for you it will not combine across partition and table after and 
before compaction looks same. If your load has multiple partition values and 
next loads has previous loads partition values, then only compaction can be 
useful 

> Compaction is not working properly
> --
>
> Key: CARBONDATA-4106
> URL: https://issues.apache.org/jira/browse/CARBONDATA-4106
> Project: CarbonData
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 2.0.1
> Environment: Apache spark 2.4.5, carbonData 2.0.1
>Reporter: suyash yadav
>Priority: Major
> Fix For: 2.0.1
>
> Attachments: describe_fact_probe_1
>
>
> Hi Team,
> We are using apache carbondata 2.0.1 for one of our POC and we observed that 
> we are not getting proper benifit from using compaction (Both majour and 
> minor).
> Please find below details for the issue we are facing:
> *Name of the table used*:  fact_365_1_probe_1
> +*Number of rows:*
> +
> select count(*) from fact_365_1_probe_1
>  ++
>  |count(1)|
>  ++
>  |76963753|
> *Sample data from the table:*
> ==
> +---+--++--+-+---+
>  | ts| metric| tags_id| value| epoch| ts2|
>  
> +---+--++--+-+---+
>  |2021-01-07 
> 21:05:00|Probe.Duplicate.Poll.Count|c8dead9b-87ae-46ae-8703-bc2b7bfba5d4|39.611356797970274|1610033757768|2021-01-07
>  00:00:00|
>  |2021-01-07 
> 23:50:00|Probe.Duplicate.Poll.Count|62351ef2-f2ce-49d1-a2fd-a0d1e5f6a1b9| 
> 72.70658115131307|1610043742516|2021-01-07 00:00:00|
>  
> [^describe_fact_probe_1]
>  
> I have attached  the describe output which will show you the other details of 
> the table.
> The size of the table is 3.24 GB and even after running minor or majour 
> compaction the size remain almost the same.
> So we re not getting any benifit by running the compaction.Could you please 
> review the shared details and help us in identifying if we are missing 
> something here or is there any bug?
> Also we need answer to the following questions about carbondata storate:
> 1. In case of decimal values, how the storage behaves like if i have one row 
> with 20 digits after decimal and second row has only 5 digits  after decimal 
> so how and what would be the difference in the storage taken.
> 2. My second question is , if i have two tables and one of the table has same 
> values for 100 rows and other table has different values for 100 rows so how 
> carbon will behave as far as the storage is concerned in this scenario. WHich 
> table will take less storage or both will take same storage.
> 3.Also for string datatype could you please describe what is the storage 
> defined for string datatype.
>  
> 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (CARBONDATA-4106) Compaction is not working properly

2021-01-14 Thread Ajantha Bhat (Jira)


[ 
https://issues.apache.org/jira/browse/CARBONDATA-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17265778#comment-17265778
 ] 

Ajantha Bhat edited comment on CARBONDATA-4106 at 1/15/21, 7:33 AM:


0. Compaction cannot guarantee the reduction in table size, it can only merge 
the small files into big files. (with this IO time can be reduced during query)
 Reducing the total table size depends on many factors including data 
cardinality.
 Also, your table is already partitioned, compaction will try to merge the 
segments within the same partition, so will not make much difference for a few 
segments. Also will not combine files if each load goes into a different 
partition folder (value) than the previous load.

1. In a column, we group 32000 rows into pages, so the final storage data type 
depends on all the values in the column page. We try to apply adaptive and 
delta encoding for these 32000 values to try to store them in less space than 
the actual data type.

2. table will same values as 100 rows will be smaller in storage as we do RLE 
encoding and compression.

3. By default string undergoes dictionary encoding, we store and encoding INT 
values. If the cardinality in a blocklet is more than 1, cannot use 
dictionary. That time we fallback to storing as string itself in a byte array 
format.


was (Author: ajantha_bhat):
0. Compaction cannot guarantee the reduction in table size, it can only merge 
the small files into big files. (with this IO time can be reduced during query)
 Reducing the total table size depends on many factors including data 
cardinality.
 Also, your table is already partitioned, compaction will try to merge the 
segments within the same partition, so will not make much difference for a few 
segments.

1. In a column, we group 32000 rows into pages, so the final storage data type 
depends on all the values in the column page. We try to apply adaptive and 
delta encoding for these 32000 values to try to store them in less space than 
the actual data type.

2. table will same values as 100 rows will be smaller in storage as we do RLE 
encoding and compression.

3. By default string undergoes dictionary encoding, we store and encoding INT 
values. If the cardinality in a blocklet is more than 1, cannot use 
dictionary. That time we fallback to storing as string itself in a byte array 
format.

> Compaction is not working properly
> --
>
> Key: CARBONDATA-4106
> URL: https://issues.apache.org/jira/browse/CARBONDATA-4106
> Project: CarbonData
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 2.0.1
> Environment: Apache spark 2.4.5, carbonData 2.0.1
>Reporter: suyash yadav
>Priority: Major
> Fix For: 2.0.1
>
> Attachments: describe_fact_probe_1
>
>
> Hi Team,
> We are using apache carbondata 2.0.1 for one of our POC and we observed that 
> we are not getting proper benifit from using compaction (Both majour and 
> minor).
> Please find below details for the issue we are facing:
> *Name of the table used*:  fact_365_1_probe_1
> +*Number of rows:*
> +
> select count(*) from fact_365_1_probe_1
>  ++
>  |count(1)|
>  ++
>  |76963753|
> *Sample data from the table:*
> ==
> +---+--++--+-+---+
>  | ts| metric| tags_id| value| epoch| ts2|
>  
> +---+--++--+-+---+
>  |2021-01-07 
> 21:05:00|Probe.Duplicate.Poll.Count|c8dead9b-87ae-46ae-8703-bc2b7bfba5d4|39.611356797970274|1610033757768|2021-01-07
>  00:00:00|
>  |2021-01-07 
> 23:50:00|Probe.Duplicate.Poll.Count|62351ef2-f2ce-49d1-a2fd-a0d1e5f6a1b9| 
> 72.70658115131307|1610043742516|2021-01-07 00:00:00|
>  
> [^describe_fact_probe_1]
>  
> I have attached  the describe output which will show you the other details of 
> the table.
> The size of the table is 3.24 GB and even after running minor or majour 
> compaction the size remain almost the same.
> So we re not getting any benifit by running the compaction.Could you please 
> review the shared details and help us in identifying if we are missing 
> something here or is there any bug?
> Also we need answer to the following questions about carbondata storate:
> 1. In case of decimal values, how the storage behaves like if i have one row 
> with 20 digits after decimal and second row has only 5 digits  after decimal 
> so how and what would be the difference in the storage taken.
> 2. My second question is , if i have two tables and one of the table has same 
> values for 100 rows and other table has different values for 100 rows so how 
> carbon will 

[jira] [Comment Edited] (CARBONDATA-4106) Compaction is not working properly

2021-01-14 Thread Ajantha Bhat (Jira)


[ 
https://issues.apache.org/jira/browse/CARBONDATA-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17265778#comment-17265778
 ] 

Ajantha Bhat edited comment on CARBONDATA-4106 at 1/15/21, 7:30 AM:


0. Compaction cannot guarantee the reduction in table size, it can only merge 
the small files into big files. (with this IO time can be reduced during query)
 Reducing the total table size depends on many factors including data 
cardinality.
 Also, your table is already partitioned, compaction will try to merge the 
segments within the same partition, so will not make much difference for a few 
segments.

1. In a column, we group 32000 rows into pages, so the final storage data type 
depends on all the values in the column page. We try to apply adaptive and 
delta encoding for these 32000 values to try to store them in less space than 
the actual data type.

2. table will same values as 100 rows will be smaller in storage as we do RLE 
encoding and compression.

3. By default string undergoes dictionary encoding, we store and encoding INT 
values. If the cardinality in a blocklet is more than 1, cannot use 
dictionary. That time we fallback to storing as string itself in a byte array 
format.


was (Author: ajantha_bhat):
0. Compaction cannot gaurentee the reduction in table size, it can only merge 
the small files into big files. (with this IO time can be reduced during query)
Reducing the total table size depends on many factors including data 
cardinality.
Also your table is already partitioned, compaction will try to merge the 
segments within the same parition, so will not make much difference for few 
segments.

1. In a column, we group 32000 rows into pages, so final storage data type 
depends on the all the values in the column page. We try to apply adaptive and 
delte encoding for this 32000 values to try to store it in less space than 
actual data type.

2. table will same values as 100 rows will be smaller in storage as we do RLE 
encoding and compression.

3. By default string undergoes dictionary encoding, we store and encoding INT 
values. If the cardinality in a blocklet is more than 1, then cannot use 
dictionary. That time we fallback to storing as string itself as a byte array 
format.

> Compaction is not working properly
> --
>
> Key: CARBONDATA-4106
> URL: https://issues.apache.org/jira/browse/CARBONDATA-4106
> Project: CarbonData
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 2.0.1
> Environment: Apache spark 2.4.5, carbonData 2.0.1
>Reporter: suyash yadav
>Priority: Major
> Fix For: 2.0.1
>
> Attachments: describe_fact_probe_1
>
>
> Hi Team,
> We are using apache carbondata 2.0.1 for one of our POC and we observed that 
> we are not getting proper benifit from using compaction (Both majour and 
> minor).
> Please find below details for the issue we are facing:
> *Name of the table used*:  fact_365_1_probe_1
> +*Number of rows:*
> +
> select count(*) from fact_365_1_probe_1
>  ++
>  |count(1)|
>  ++
>  |76963753|
> *Sample data from the table:*
> ==
> +---+--++--+-+---+
>  | ts| metric| tags_id| value| epoch| ts2|
>  
> +---+--++--+-+---+
>  |2021-01-07 
> 21:05:00|Probe.Duplicate.Poll.Count|c8dead9b-87ae-46ae-8703-bc2b7bfba5d4|39.611356797970274|1610033757768|2021-01-07
>  00:00:00|
>  |2021-01-07 
> 23:50:00|Probe.Duplicate.Poll.Count|62351ef2-f2ce-49d1-a2fd-a0d1e5f6a1b9| 
> 72.70658115131307|1610043742516|2021-01-07 00:00:00|
>  
> [^describe_fact_probe_1]
>  
> I have attached  the describe output which will show you the other details of 
> the table.
> The size of the table is 3.24 GB and even after running minor or majour 
> compaction the size remain almost the same.
> So we re not getting any benifit by running the compaction.Could you please 
> review the shared details and help us in identifying if we are missing 
> something here or is there any bug?
> Also we need answer to the following questions about carbondata storate:
> 1. In case of decimal values, how the storage behaves like if i have one row 
> with 20 digits after decimal and second row has only 5 digits  after decimal 
> so how and what would be the difference in the storage taken.
> 2. My second question is , if i have two tables and one of the table has same 
> values for 100 rows and other table has different values for 100 rows so how 
> carbon will behave as far as the storage is concerned in this scenario. WHich 
> table will take less storage or both will take same 

[jira] [Commented] (CARBONDATA-4106) Compaction is not working properly

2021-01-14 Thread Ajantha Bhat (Jira)


[ 
https://issues.apache.org/jira/browse/CARBONDATA-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17265778#comment-17265778
 ] 

Ajantha Bhat commented on CARBONDATA-4106:
--

0. Compaction cannot gaurentee the reduction in table size, it can only merge 
the small files into big files. (with this IO time can be reduced during query)
Reducing the total table size depends on many factors including data 
cardinality.
Also your table is already partitioned, compaction will try to merge the 
segments within the same parition, so will not make much difference for few 
segments.

1. In a column, we group 32000 rows into pages, so final storage data type 
depends on the all the values in the column page. We try to apply adaptive and 
delte encoding for this 32000 values to try to store it in less space than 
actual data type.

2. table will same values as 100 rows will be smaller in storage as we do RLE 
encoding and compression.

3. By default string undergoes dictionary encoding, we store and encoding INT 
values. If the cardinality in a blocklet is more than 1, then cannot use 
dictionary. That time we fallback to storing as string itself as a byte array 
format.

> Compaction is not working properly
> --
>
> Key: CARBONDATA-4106
> URL: https://issues.apache.org/jira/browse/CARBONDATA-4106
> Project: CarbonData
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 2.0.1
> Environment: Apache spark 2.4.5, carbonData 2.0.1
>Reporter: suyash yadav
>Priority: Major
> Fix For: 2.0.1
>
> Attachments: describe_fact_probe_1
>
>
> Hi Team,
> We are using apache carbondata 2.0.1 for one of our POC and we observed that 
> we are not getting proper benifit from using compaction (Both majour and 
> minor).
> Please find below details for the issue we are facing:
> *Name of the table used*:  fact_365_1_probe_1
> +*Number of rows:*
> +
> select count(*) from fact_365_1_probe_1
>  ++
>  |count(1)|
>  ++
>  |76963753|
> *Sample data from the table:*
> ==
> +---+--++--+-+---+
>  | ts| metric| tags_id| value| epoch| ts2|
>  
> +---+--++--+-+---+
>  |2021-01-07 
> 21:05:00|Probe.Duplicate.Poll.Count|c8dead9b-87ae-46ae-8703-bc2b7bfba5d4|39.611356797970274|1610033757768|2021-01-07
>  00:00:00|
>  |2021-01-07 
> 23:50:00|Probe.Duplicate.Poll.Count|62351ef2-f2ce-49d1-a2fd-a0d1e5f6a1b9| 
> 72.70658115131307|1610043742516|2021-01-07 00:00:00|
>  
> [^describe_fact_probe_1]
>  
> I have attached  the describe output which will show you the other details of 
> the table.
> The size of the table is 3.24 GB and even after running minor or majour 
> compaction the size remain almost the same.
> So we re not getting any benifit by running the compaction.Could you please 
> review the shared details and help us in identifying if we are missing 
> something here or is there any bug?
> Also we need answer to the following questions about carbondata storate:
> 1. In case of decimal values, how the storage behaves like if i have one row 
> with 20 digits after decimal and second row has only 5 digits  after decimal 
> so how and what would be the difference in the storage taken.
> 2. My second question is , if i have two tables and one of the table has same 
> values for 100 rows and other table has different values for 100 rows so how 
> carbon will behave as far as the storage is concerned in this scenario. WHich 
> table will take less storage or both will take same storage.
> 3.Also for string datatype could you please describe what is the storage 
> defined for string datatype.
>  
> 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CARBONDATA-4108) How to connect carbondata with Hive

2021-01-14 Thread suyash yadav (Jira)
suyash yadav created CARBONDATA-4108:


 Summary: How to connect carbondata with Hive
 Key: CARBONDATA-4108
 URL: https://issues.apache.org/jira/browse/CARBONDATA-4108
 Project: CarbonData
  Issue Type: Improvement
  Components: core
Affects Versions: 2.0.1
 Environment: apache carbondata 2.0.1, spark 2.4.5, Hive 2.0
Reporter: suyash yadav
 Fix For: 2.0.1


Hi Team,

We would like to know how to connect hive with carbondata.We are doing a POC 
where in we need to access carbondata table through hive but we need this 
configuration with username and password. So our hive connection should have 
some username and password configuration to connect to carbondata tables.

 

Could you guys please review above requirement and suggest steps to achieve the 
same.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [carbondata] MarvinLitt commented on a change in pull request #4077: [DOC] Running the Thrift JDBC/ODBC server with CarbonExtensions

2021-01-14 Thread GitBox


MarvinLitt commented on a change in pull request #4077:
URL: https://github.com/apache/carbondata/pull/4077#discussion_r557878565



##
File path: docs/quick-start-guide.md
##
@@ -325,9 +325,17 @@ mv carbondata.tar.gz carbonlib/
 
 
 
-## Query Execution Using CarbonData Thrift Server
+## Query Execution Using the Thrift Server
 
-### Starting CarbonData Thrift Server.
+### Option 1: Starting Thrift Server with CarbonExtensions(since 2.0)
+```
+cd $SPARK_HOME
+./sbin/start-thriftserver.sh \
+--conf spark.sql.extensions=org.apache.spark.sql.CarbonExtensions \
+$SPARK_HOME/carbonlib/apache-carbondata-xxx.jar
+```

Review comment:
   Users can easily use start- thriftserver.sh. It's great!





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] Karan980 edited a comment on pull request #4075: [CARBONDATA-4105]Select * query fails after a SDK written segment is added by alter table add segment query.

2021-01-14 Thread GitBox


Karan980 edited a comment on pull request #4075:
URL: https://github.com/apache/carbondata/pull/4075#issuecomment-760659123


   > do you have the reproduce steps?
   
   Yes, run this test case which i have added in this PR without any other 
changes from the PR. (Test add segment by carbon written by SDK on which read 
is already performed). At last, in place of select count(*), just run select * 
query.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] Karan980 commented on pull request #4075: [CARBONDATA-4105]Select * query fails after a SDK written segment is added by alter table add segment query.

2021-01-14 Thread GitBox


Karan980 commented on pull request #4075:
URL: https://github.com/apache/carbondata/pull/4075#issuecomment-760659123


   > do you have the reproduce steps?
   
   Yes, run this test case which i have added in this PR without any other 
changes from the PR. (Test add segment by carbon written by SDK on which read 
is already performed)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] akashrn5 commented on pull request #4034: [CARBONDATA-4091] support prestosql 333 integartion with carbon

2021-01-14 Thread GitBox


akashrn5 commented on pull request #4034:
URL: https://github.com/apache/carbondata/pull/4034#issuecomment-760656532


   LGTM, @jackylk can review it once



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4077: [DOC] Running the Thrift JDBC/ODBC server with CarbonExtensions

2021-01-14 Thread GitBox


CarbonDataQA2 commented on pull request #4077:
URL: https://github.com/apache/carbondata/pull/4077#issuecomment-760649473


   Build Success with Spark 2.3.4, Please check CI 
http://121.244.95.60:12444/job/ApacheCarbonPRBuilder2.3/5312/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] CarbonDataQA2 commented on pull request #4077: [DOC] Running the Thrift JDBC/ODBC server with CarbonExtensions

2021-01-14 Thread GitBox


CarbonDataQA2 commented on pull request #4077:
URL: https://github.com/apache/carbondata/pull/4077#issuecomment-760648968


   Build Success with Spark 2.4.5, Please check CI 
http://121.244.95.60:12444/job/ApacheCarbon_PR_Builder_2.4.5/3552/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] Indhumathi27 commented on pull request #4076: [WIP][CARBONDATA-4107] Added mvExists property for MV fact table and added lock while touchMDTFile

2021-01-14 Thread GitBox


Indhumathi27 commented on pull request #4076:
URL: https://github.com/apache/carbondata/pull/4076#issuecomment-760646765


   @QiangCai Even if we list all mv tables in fact table, we need the database 
information in order to fetch schema files. and also, a table can have more 
number of MV's, which can increase size of fact table  



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] QiangCai commented on pull request #4075: [CARBONDATA-4105]Select * query fails after a SDK written segment is added by alter table add segment query.

2021-01-14 Thread GitBox


QiangCai commented on pull request #4075:
URL: https://github.com/apache/carbondata/pull/4075#issuecomment-760630120


   do you have the reproduce steps?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] QiangCai commented on pull request #4076: [WIP][CARBONDATA-4107] Added mvExists property for MV fact table and added lock while touchMDTFile

2021-01-14 Thread GitBox


QiangCai commented on pull request #4076:
URL: https://github.com/apache/carbondata/pull/4076#issuecomment-760627378


   how about to list all mv table in fact table properties?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [carbondata] QiangCai opened a new pull request #4077: [DOC] Running the Thrift DBC/ODBC server with CarbonExtensions

2021-01-14 Thread GitBox


QiangCai opened a new pull request #4077:
URL: https://github.com/apache/carbondata/pull/4077


### Why is this PR needed?
since version 2.0,  carbon supports starting spark ThriftServer with 
CarbonExtensions.

### What changes were proposed in this PR?
add the document to start spark ThriftServer with CarbonExtensions.
   
### Does this PR introduce any user interface change?
- No
   
### Is any new testcase added?
- No
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org