carbon data performance doubts

Swapnil Shinde Wed, 19 Jul 2017 00:24:39 -0700

Hello All
     I am trying carbon data for the first time and having few question on
improving performance -


1. What is the use of *carbon.number.of.cores *property and how is it
different from spark's executor cores?

2. Documentation says, by default, all non-numeric columns (except complex
types) become dimensions and numeric columns become measure. How dimensions
and measure columns are handled diferently? What are the pros and cons of
keeping any column as dimension vs measure?

3. What is the best way when we have a ID INT column which is will be used
heavily for filteration/agg/joins but can't be dimension by default.
Documentation says to include these kind of numeric columns with
"dictionay_include" or "dictionary_exclude" in table definition so that
column will be considered as dimenstion. It is not supported to keep
non-string data types as "dictionary_exclude" (link
<https://github.com/apache/carbondata/blob/6488bc018a2ec715b31407d12290680d388a43b3/integration/spark-common/src/main/scala/org/apache/spark/sql/catalyst/CarbonDDLSqlParser.scala#L690>)
Then do we have to enable dictionary encoding for ID INT columns which is
beneficial to encode.

4. How MDK gets generated and how can we alter it? Any API to find out MDK
for given table?


        It will be good to know to understand above concept in details so
we can use carbon data effectively?


Thanks
Swapnil

carbon data performance doubts

Reply via email to