[jira] [Updated] (TAJO-601) Improve distinct aggregation query processing

Hyunsik Choi (JIRA) Tue, 18 Feb 2014 04:03:13 -0800

     [ 
https://issues.apache.org/jira/browse/TAJO-601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyunsik Choi updated TAJO-601:
------------------------------

    Attachment: TAJO-601.patch

> Improve distinct aggregation query processing
> ---------------------------------------------
>
>                 Key: TAJO-601
>                 URL: https://issues.apache.org/jira/browse/TAJO-601
>             Project: Tajo
>          Issue Type: Improvement
>          Components: planner/optimizer
>            Reporter: Hyunsik Choi
>            Assignee: Hyunsik Choi
>             Fix For: 0.8-incubating
>
>         Attachments: TAJO-601.patch
>
>
> Currently, distinct aggregation queries are executed as follows:
> * the first stage: it just shuffles tuples by hashing grouping keys.
> * the second stage: it sorts them and executes sort aggregation.
> This way executes queries including distinct aggregation functions with only 
> two stages. But, it leads to large intermediate data during shuffle phase.
> This kind of query can be rewritten as two queries:
> {code:title=original query}
> SELECT grp1, grp2, count(*) as total, count(distinct grp3) as distinct_col 
> from rel1 group by grp1, grp2;
> {code}
> {code:title=rewritten query}
> SELECT grp1, grp2, sum(cnt) as total, count(grp3) as distinct_col from (
>   SELECT grp1, grp2, grp3, count(*) as cnt from rel1 group by grp1, grp2, 
> grp3) tmp1 group by grp1, grp2
> ) table1;
> {code}
> I'm expecting that this rewrite will significantly reduce the intermediate 
> data volume and query response time in most cases.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Updated] (TAJO-601) Improve distinct aggregation query processing

Reply via email to