[jira] [Updated] (CARBONDATA-2091) Enhance data loading performance by specifying range bounds for sort columns

Indhumathi Muthumurugesh (Jira) Sun, 22 Nov 2020 22:02:06 -0800


     [ 
https://issues.apache.org/jira/browse/CARBONDATA-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Indhumathi Muthumurugesh updated CARBONDATA-2091:
-------------------------------------------------
    Fix Version/s:     (was: 2.0.2)
                   2.0.1

> Enhance data loading performance by specifying range bounds for sort columns
> ----------------------------------------------------------------------------
>
>                 Key: CARBONDATA-2091
>                 URL: https://issues.apache.org/jira/browse/CARBONDATA-2091
>             Project: CarbonData
>          Issue Type: Improvement
>            Reporter: Chuanyin Xu
>            Assignee: Chuanyin Xu
>            Priority: Major
>             Fix For: 2.0.1
>
>          Time Spent: 8h 40m
>  Remaining Estimate: 0h
>
> Currently in carbondata, data loading using node_sort (also known as 
> local_sort) has the following procedures:
>  # convert the input data in batch. (*Convert*)
>  # sort the batch and write to the sort temp files. (*TempSort*)
>  # combine the sort temp files and do merge sort to get a bigger ordered sort 
> temp file. (*MergeSort*)
>  # combine all the sort temp files and do a final sort, its results will feed 
> the next procedure. (*FinalSort*)
>  # get rows in order and convert rows to carbondata columnar format pages. 
> (*produce*)
>  # Write bundles of pages to files and write the corresponding index file. 
> (*consume*)
> The Step1~Step3 are done concurrently using multi-thread. The Step4 is done 
> using only one thread. The Step5 is done using multi-thread. So the Step4 is 
> the bottleneck among all the procedures. When observing the data loading 
> performance, we can see that the CPU usage after Step3 is low.
>  
> We can enhance the data loading performance by parallelizing Step4.
>  
> User can specify range bounds for the sort columns and carbondata internally 
> distributes the records to different ranges and process the data concurrently 
> in different ranges.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (CARBONDATA-2091) Enhance data loading performance by specifying range bounds for sort columns

Reply via email to