[ https://issues.apache.org/jira/browse/DRILL-386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steven Phillips updated DRILL-386: ---------------------------------- Attachment: DRILL-386.patch Rebased on current master. > Implement External Sort operator > -------------------------------- > > Key: DRILL-386 > URL: https://issues.apache.org/jira/browse/DRILL-386 > Project: Apache Drill > Issue Type: New Feature > Reporter: Steven Phillips > Assignee: Steven Phillips > Attachments: DRILL-386.patch, DRILL-386.patch, DRILL-386.patch > > > This operator will allow sorting data that is larger than the size of memory > by spilling to disk when necessary. > Also as part of this jira, I will implement a new merge sort algorithm that > will hopefully better utilize cluster resources than our current sort, which > is based on Quicksort. The problem with quicksort is that we can't do any > sorting until all of the batches have arrived. But this will often result in > very low CPU utilization while the data is read from disk, followed by a > period of very high CPU usage. > The external sort will include sorting each batch individually when it > arrives. In the case where no spills occur, it makes sense to take advantage > of the fact that the batches are already sorted, but doing the an N-way > merge, as is done when there are spills, is not the most effective way to do > this, (according to some tests I have done). What will be done in this case, > rather than an N-way merge using a heap, we will do a variation of natural > merge sort, which amounts to essentially a staged, 2-way merge of the > incoming (sorted) batches. -- This message was sent by Atlassian JIRA (v6.2#6252)