[ 
https://issues.apache.org/jira/browse/TAJO-36?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13890465#comment-13890465
 ] 

Hyunsik Choi commented on TAJO-36:
----------------------------------

This patch completely rewrites ExternalSortExec which is a physical operator 
for sort. It has the following improvements:
 * in-memory sort if input data fits main-memory (given sort buffer)
 * it works as n-way external merge sort operator if input data exceeds once a 
sort buffer size.
 * It performs k-way merge and it can be parallelized depending on the config 
(tajo.executor.external-sort.thread-num).
 * It performs unbalanced merge if possible. (if you need more information, 
please refer a paper 'Query evaluation techniques for large databases' or a 
polyphase sort.
 * It uses worker temporal directories in a round-robin manner.
 * It omits the final merge step because the final result is used as an input 
data of the subsequent physical operators. So, we don't need to merge the 
remain input files into only one file.

You can adjust the following configs, and later ExternalSortExec should be able 
to change the following configs according tothe allocated container's resource.
{code:title=TajoConf.java}
 EXECUTOR_EXTERNAL_SORT_THREAD_NUM("tajo.executor.external-sort.thread-num", 1),
 EXECUTOR_EXTERNAL_SORT_BUFFER_SIZE("tajo.executor.external-sort.buffer-mb", 
200),
 EXECUTOR_EXTERNAL_SORT_FANOUT("tajo.executor.external-sort.fanout-num", 8),
{code}

> Improve ExternalSortExec with N-merge sort and final pass omission
> ------------------------------------------------------------------
>
>                 Key: TAJO-36
>                 URL: https://issues.apache.org/jira/browse/TAJO-36
>             Project: Tajo
>          Issue Type: Improvement
>          Components: physical operator
>            Reporter: Hyunsik Choi
>            Assignee: Hyunsik Choi
>            Priority: Critical
>             Fix For: 0.8-incubating
>
>         Attachments: TAJO-36.patch
>
>
> Background:
> The current ExternalSortExec just uses the binary external merge sort 
> algorithm 
> (http://en.wikipedia.org/wiki/External_sorting#External_merge_sort). In other 
> words, for each pass, ExternalSortExec just merges two files into one sorted 
> file.
> Proposal:
> The goal of this proposal is to improve ExternalSortExec with the following 
> improvements:
> * N-merge sort - we can merge N files though more memory at each pass. It 
> will reduce the number of passes. Consequently, it will reduces considerable 
> I/O overheads.
> * the final pass omission - a physical operator is pipelined by the parent 
> operator. The final pass of the merge sort must also be invoked by the parent 
> physical operator. So, we can omit the final pass of the merge sort.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to