[ 
https://issues.apache.org/jira/browse/DRILL-386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Phillips updated DRILL-386:
----------------------------------

    Attachment: DRILL-386.patch

Rebased on current master.

> Implement External Sort operator
> --------------------------------
>
>                 Key: DRILL-386
>                 URL: https://issues.apache.org/jira/browse/DRILL-386
>             Project: Apache Drill
>          Issue Type: New Feature
>            Reporter: Steven Phillips
>            Assignee: Steven Phillips
>         Attachments: DRILL-386.patch, DRILL-386.patch, DRILL-386.patch
>
>
> This operator will allow sorting data that is larger than the size of memory 
> by spilling to disk when necessary.
> Also as part of this jira, I will implement a new merge sort algorithm that 
> will hopefully better utilize cluster resources than our current sort, which 
> is based on Quicksort. The problem with quicksort is that we can't do any 
> sorting until all of the batches have arrived. But this will often result in 
> very low CPU utilization while the data is read from disk, followed by a 
> period of very high CPU usage.
> The external sort will include sorting each batch individually when it 
> arrives. In the case where no spills occur, it makes sense to take advantage 
> of the fact that the batches are already sorted, but doing the an N-way 
> merge, as is done when there are spills, is not the most effective way to do 
> this, (according to some tests I have done). What will be done in this case, 
> rather than an N-way merge using a heap, we will do a variation of natural 
> merge sort, which amounts to essentially a staged, 2-way merge of the 
> incoming (sorted) batches.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to