Re: carbon data performance doubts

2017-07-19 Thread Swapnil Shinde
Thank you, Manish. Is dictionary exclude supported for datatypes other than String? https://github.com/apache/carbondata/blob/6488bc018a2ec715b31407d12290680d388a43b3/integration/spark-common/src/main/scala/org/apache/spark/sql/catalyst/CarbonDDLSqlParser.scala#L706 - Swapnil On Wed, Jul 19, 2017

Re: Presto+CarbonData optimization work discussion

2017-07-19 Thread Liang Chen
Hi Ravi Thanks for your comment. I tested again with excluding province as dictionary. In spark, the query time is around 3 seconds, in presto same is 9 seconds. so for this query case(short string), dictionary lazy decode might not be the key factor. Regards Liang 2017-07-20 10:56 GMT+08:00

Re: Presto+CarbonData optimization work discussion

2017-07-19 Thread Ravindra Pesala
Hi Liang, I see that province column data is not big, so I guess it hardly make any impact with lazy decoding in this scenario. Can you do one more test by excluding the province from dictionary in both presto and spark integrations. It will tell whether it is really a lazy decoding issue or not.

Re: carbon data performance doubts

2017-07-19 Thread manishgupta88
Hi Swapnil Please find my answers inline. 1. What is the use of *carbon.number.of.cores *property and how is it different from spark's executor cores? -carbon.number.of.cores is used for reading the footer and header of the carbondata file during query execution. Spark executor cores is a proper

Re: Presto+CarbonData optimization work discussion

2017-07-19 Thread Liang Chen
Hi For -- 4) Lazy decoding of the dictionary, just i tested 180 millions rows data with the script: "select province,sum(age),count(*) from presto_carbondata group by province order by province" Spark integration module has "dictionary lazy decode", presto doesn't have "dictionary lazy decode",

Presto+CarbonData optimization work discussion

2017-07-19 Thread Liang Chen
Hi Below are some proposed items for Presto optimization: 1) Remove the extra loops for data conversion in Presto Format to increase the performance. 2) Modularize and optimize the filters . 3) Optimize the Carbondata Metadata reading. 4) Lazy decoding of the dictionary. 5) Batch reading of the

carbon data performance doubts

2017-07-19 Thread Swapnil Shinde
Hello All I am trying carbon data for the first time and having few question on improving performance - 1. What is the use of *carbon.number.of.cores *property and how is it different from spark's executor cores? 2. Documentation says, by default, all non-numeric columns (except complex type