[
https://issues.apache.org/jira/browse/TAJO-601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyunsik Choi updated TAJO-601:
------------------------------
Attachment: TAJO-601.patch
> Improve distinct aggregation query processing
> ---------------------------------------------
>
> Key: TAJO-601
> URL: https://issues.apache.org/jira/browse/TAJO-601
> Project: Tajo
> Issue Type: Improvement
> Components: planner/optimizer
> Reporter: Hyunsik Choi
> Assignee: Hyunsik Choi
> Fix For: 0.8-incubating
>
> Attachments: TAJO-601.patch
>
>
> Currently, distinct aggregation queries are executed as follows:
> * the first stage: it just shuffles tuples by hashing grouping keys.
> * the second stage: it sorts them and executes sort aggregation.
> This way executes queries including distinct aggregation functions with only
> two stages. But, it leads to large intermediate data during shuffle phase.
> This kind of query can be rewritten as two queries:
> {code:title=original query}
> SELECT grp1, grp2, count(*) as total, count(distinct grp3) as distinct_col
> from rel1 group by grp1, grp2;
> {code}
> {code:title=rewritten query}
> SELECT grp1, grp2, sum(cnt) as total, count(grp3) as distinct_col from (
> SELECT grp1, grp2, grp3, count(*) as cnt from rel1 group by grp1, grp2,
> grp3) tmp1 group by grp1, grp2
> ) table1;
> {code}
> I'm expecting that this rewrite will significantly reduce the intermediate
> data volume and query response time in most cases.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)