[ https://issues.apache.org/jira/browse/CARBONDATA-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Indhumathi Muthumurugesh updated CARBONDATA-2091: ------------------------------------------------- Fix Version/s: (was: 2.0.2) 2.0.1 > Enhance data loading performance by specifying range bounds for sort columns > ---------------------------------------------------------------------------- > > Key: CARBONDATA-2091 > URL: https://issues.apache.org/jira/browse/CARBONDATA-2091 > Project: CarbonData > Issue Type: Improvement > Reporter: Chuanyin Xu > Assignee: Chuanyin Xu > Priority: Major > Fix For: 2.0.1 > > Time Spent: 8h 40m > Remaining Estimate: 0h > > Currently in carbondata, data loading using node_sort (also known as > local_sort) has the following procedures: > # convert the input data in batch. (*Convert*) > # sort the batch and write to the sort temp files. (*TempSort*) > # combine the sort temp files and do merge sort to get a bigger ordered sort > temp file. (*MergeSort*) > # combine all the sort temp files and do a final sort, its results will feed > the next procedure. (*FinalSort*) > # get rows in order and convert rows to carbondata columnar format pages. > (*produce*) > # Write bundles of pages to files and write the corresponding index file. > (*consume*) > The Step1~Step3 are done concurrently using multi-thread. The Step4 is done > using only one thread. The Step5 is done using multi-thread. So the Step4 is > the bottleneck among all the procedures. When observing the data loading > performance, we can see that the CPU usage after Step3 is low. > > We can enhance the data loading performance by parallelizing Step4. > > User can specify range bounds for the sort columns and carbondata internally > distributes the records to different ranges and process the data concurrently > in different ranges. -- This message was sent by Atlassian Jira (v8.3.4#803005)