[ 
https://issues.apache.org/jira/browse/HIVE-1018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881597#action_12881597
 ] 

Joydeep Sen Sarma commented on HIVE-1018:
-----------------------------------------

interesting idea. in most of the queries i have written (over the course of 
last few months - this has involved a *lot* of joins and group-bys) - either 
the aggregate expressions or the group by clause would have a combination of 
columns from all tables being joined. these would be fairly hard to optimize 
based on the ideas outlined here.

in most of the join+group-by cases i see - people are joining fact with 
dimension and then using the at least some non-join columns of the dimension 
for grouping (typically along with some columns from fact). the join/grouping 
columns being equal/superset seems interesting - but i am not sure about 
practical applicability.

even in the cases mentioned - some alternate trivial but effective 
optimizations are available:
1. join key=grouping key - grouping operator should realize that data is 
already sorted/clustered by grouping key (because it was joined on the same 
key). in this case we don't need partial aggregates - but can generate full 
aggregates off the output of the join. no hash maps required.
2. join key = subset of grouping keys - in this case (for sort merge join) - we 
can sort on the grouping keys (doesn't hurt much) for doing the join and then 
apply strategy #1.

> pushing down group-by before joins
> ----------------------------------
>
>                 Key: HIVE-1018
>                 URL: https://issues.apache.org/jira/browse/HIVE-1018
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>
> Queries with both Group-by and Joins are very common and they are expensive 
> operations. By default Hive evalutes Joins first then group-by. Sometimes it 
> is possible to rewrite queries to apply group-by (or map-side partial group 
> by) first before join. This will remove a lot of duplicated keys in joins and 
> alleviate skewness in join keys for this case. This rewrite should be 
> cost-based. Before we have the stats and the CB framework, we can give users 
> hints to do the rewrite. 
> A particular case is where the join keys are the same as the grouping keys. 
> Or the group keys is a superset of the join keys (so that grouping won't 
> affect the result of joins). 
> Examples:
> -- Q1
> select A.key, B.key
> from A join B on (A.key=B.key)
> group by A.key, B.key;
> --Q2
> select distinct A.key, B.key
> from A join B on (A.key=B.key);
> --Q3, aggregation function is sum, count, min, max, (avg and median cannot be 
> handled).
> selec A.key, sum(A.value), count(1), min(value), max(value)
> from A left semi join B on (A.key=B.key)
> group by A.key;
> -- Q4. grouping keys is a superset of join keys
> select distinct A.key, A.value
> from A join B on (A.key=B.key)
> In the case of join keys are not a subset of grouping keys, we can introduce 
> a map-side partial grouping operator with the keys of the UNION of the join 
> and grouping keys, to remove unnecessary duplications. This should be 
> cost-based though. 
> Any thoughts and suggestions?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to