[ 
https://issues.apache.org/jira/browse/CARBONDATA-742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ravindra Pesala reassigned CARBONDATA-742:
------------------------------------------

    Assignee: Ravindra Pesala

> Add batch sort to improve the loading performance
> -------------------------------------------------
>
>                 Key: CARBONDATA-742
>                 URL: https://issues.apache.org/jira/browse/CARBONDATA-742
>             Project: CarbonData
>          Issue Type: Improvement
>            Reporter: Ravindra Pesala
>            Assignee: Ravindra Pesala
>
> Current Problem:
> Sort step is major issue as it is blocking step. It needs to receive all data 
> and write down the sort temp files to disk, after that only data writer step 
> can start.
> Solution: 
> Make sort step as non blocking step so it avoids waiting of Data writer step.
> Process the data in sort step in batches with size of in-memory capability of 
> the machine. For suppose if machine can allocate 4 GB to process data 
> in-memory, then Sort step can sorts the data with batch size of 2GB and gives 
> it to the data writer step. By the time data writer step consumes the data, 
> sort step receives and sorts the data. So here all steps are continuously 
> working and absolutely there is no disk IO in sort step.
> So there would not be any waiting of data writer step for sort step, As and 
> when sort step sorts the data in memory data writer can start writing it.
> It can significantly improves the performance.
> Advantages:
> Increases the loading performance as there is no intermediate IO and no 
> blocking of Sort step.
> There is no extra effort for compaction, the current flow can handle it.
> Disadvantages:
> Number of driver side btrees will increase. So the memory might increase but 
> it could be controlled by current LRU cache implementation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to