[jira] [Created] (CARBONDATA-1805) Optimize pruning for dictionary loading

xuchuanyin (JIRA) Thu, 23 Nov 2017 22:30:29 -0800

xuchuanyin created CARBONDATA-1805:
--------------------------------------

             Summary: Optimize pruning for dictionary loading
                 Key: CARBONDATA-1805
                 URL: https://issues.apache.org/jira/browse/CARBONDATA-1805
             Project: CarbonData
          Issue Type: Improvement
          Components: data-load, spark-integration
            Reporter: xuchuanyin
            Assignee: xuchuanyin
             Fix For: 1.3.0



# SCENARIO

Recently I tried dictionary feature in Carbondata and found its dictionary 
generating phase in data loading is quite slow. My scenario is as below:

+ Input Data: 35.8GB CSV file with 199 columns and 126 Million lines

+ Dictionary columns: 3 columns each containing 19213,4,9 distinct values

The whole data loading consumes about 2.9min for dictionary generating and 
4.6min for fact data loading -- about 39% of the time are spent on dictionary.

Having observed the nmon result, Ifound the CPU usage were quite high during 
the dictionary generating phase and the Disk, Network were quite normal.

# ANALYZE

After I went through the dictionary generating related code, I found Carbondata 
aleady prune non-dictionary columns before generating dictionary. But the 
problem is that `the pruning comes after data file reading`, this will cause 
some overhead, we can optimize it by `prune while reading data file`.

# RESOLVE

Refactor the `loadDataFrame` method in `GlobalDictionaryUtil`, only pruning the 
non-dictionary columns while reading the data file.

After implementing the above optimization, the dictionary generating costs only 
`29s` -- `about 6 times better than before`(2.9min), and the fact data loading 
costs the same as before(4.6min), about 10% of the time are spent on dictionary.

# NOTE

+ Currently only `load data file` will benefit from this optimization, while 
`load data frame` will not.

+ Before implementing this solution, I tried another solution -- cache 
dataframe of the data file, the performance was even worse -- the dictionary 
generating time was 5.6min.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (CARBONDATA-1805) Optimize pruning for dictionary loading

Reply via email to