[jira] [Updated] (TAJO-584) Improve distributed merge sort

Hyunsik Choi (JIRA) Tue, 04 Feb 2014 13:53:07 -0800

     [ 
https://issues.apache.org/jira/browse/TAJO-584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyunsik Choi updated TAJO-584:
------------------------------

    Description: 
In Tajo, sort operator is similar to merge sort, and it works in a distributed 
manner. The first sort phase sorts each fragment in local machine, the 
intermediate data are shuffled in range partition, and then the second sort 
phase in each node sorts the range-partitioned data.

However, the second sort phase reads all shuffled data via one scanner. It 
misses the opportunity to exploit already-sorted data. This patch improves the 
second sort phase to merge directly multiple already-sorted intermediate data 
sets. It significantly reduces the response time of sort queries.

I carried out some simple benchmark with the following query on TPC-H 100GB 
data sets:
{code:sql}
select l_orderkey from lineitem order by l_orderkey;
{code}

The lineitem table occupies 75GB. The query response time are dramatically 
reduced from 480 to 260 secs. This patch exploits the design of TAJO-36. So, 
this patch requires TAJO-36.

  was:
In Tajo, sort operator is similar to merge sort, but it works in a distributed 
manner. The first sort phase sorts each fragment in local machine, the 
intermediate data are shuffled in range partition, and then the the second sort 
phase in each node sorts the range-partitioned data.

However, the second sort phase reads all shuffled data via one scanner. It 
causes performance degrade. This patch improves the second sort phase to merge 
directly all already-sorted intermediate data. It significantly reduces the 
response time of sort queries.

I carried out some simple benchmark with the following query on TPC-H 100GB 
data sets:
{code:sql}
select l_orderkey from lineitem order by l_orderkey;
{code}

The lineitem table occupies 75GB. The query response time are dramatically 
reduced from 480 to 260 secs. This patch exploits the design of TAJO-36. So, 
this patch requires TAJO-36.


> Improve distributed merge sort
> ------------------------------
>
>                 Key: TAJO-584
>                 URL: https://issues.apache.org/jira/browse/TAJO-584
>             Project: Tajo
>          Issue Type: Improvement
>          Components: distributed query plan, physical operator
>            Reporter: Hyunsik Choi
>            Assignee: Hyunsik Choi
>             Fix For: 0.8-incubating
>
>         Attachments: TAJO-584.patch
>
>
> In Tajo, sort operator is similar to merge sort, and it works in a 
> distributed manner. The first sort phase sorts each fragment in local 
> machine, the intermediate data are shuffled in range partition, and then the 
> second sort phase in each node sorts the range-partitioned data.
> However, the second sort phase reads all shuffled data via one scanner. It 
> misses the opportunity to exploit already-sorted data. This patch improves 
> the second sort phase to merge directly multiple already-sorted intermediate 
> data sets. It significantly reduces the response time of sort queries.
> I carried out some simple benchmark with the following query on TPC-H 100GB 
> data sets:
> {code:sql}
> select l_orderkey from lineitem order by l_orderkey;
> {code}
> The lineitem table occupies 75GB. The query response time are dramatically 
> reduced from 480 to 260 secs. This patch exploits the design of TAJO-36. So, 
> this patch requires TAJO-36.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (TAJO-584) Improve distributed merge sort

Reply via email to