[ https://issues.apache.org/jira/browse/HIVE-1018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881758#action_12881758 ]
Ning Zhang commented on HIVE-1018: ---------------------------------- Good points Joy. It will be interesting to see what are the typical use cases you have combining join and GroupBy. Previous what in my mind here is to optimize away the very bad case of skewness in the join (many rows with the same join key). Since GroupBy eliminates the skewness, these rewrite rules push down GroupBy before JOIN for these special cases. What you have mentioned are definitely what we should optimize for these cases. The are helpful for the general cases (non-skewed join) as well. > pushing down group-by before joins > ---------------------------------- > > Key: HIVE-1018 > URL: https://issues.apache.org/jira/browse/HIVE-1018 > Project: Hadoop Hive > Issue Type: Improvement > Reporter: Ning Zhang > > Queries with both Group-by and Joins are very common and they are expensive > operations. By default Hive evalutes Joins first then group-by. Sometimes it > is possible to rewrite queries to apply group-by (or map-side partial group > by) first before join. This will remove a lot of duplicated keys in joins and > alleviate skewness in join keys for this case. This rewrite should be > cost-based. Before we have the stats and the CB framework, we can give users > hints to do the rewrite. > A particular case is where the join keys are the same as the grouping keys. > Or the group keys is a superset of the join keys (so that grouping won't > affect the result of joins). > Examples: > -- Q1 > select A.key, B.key > from A join B on (A.key=B.key) > group by A.key, B.key; > --Q2 > select distinct A.key, B.key > from A join B on (A.key=B.key); > --Q3, aggregation function is sum, count, min, max, (avg and median cannot be > handled). > selec A.key, sum(A.value), count(1), min(value), max(value) > from A left semi join B on (A.key=B.key) > group by A.key; > -- Q4. grouping keys is a superset of join keys > select distinct A.key, A.value > from A join B on (A.key=B.key) > In the case of join keys are not a subset of grouping keys, we can introduce > a map-side partial grouping operator with the keys of the UNION of the join > and grouping keys, to remove unnecessary duplications. This should be > cost-based though. > Any thoughts and suggestions? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.