[
https://issues.apache.org/jira/browse/HIVE-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yin Huai updated HIVE-4827:
---------------------------
Summary: Merge a Map-only task to its child task (was: Merge a Map-only
job to its following MapReduce job with multiple inputs)
> Merge a Map-only task to its child task
> ---------------------------------------
>
> Key: HIVE-4827
> URL: https://issues.apache.org/jira/browse/HIVE-4827
> Project: Hive
> Issue Type: Improvement
> Components: Query Processor
> Affects Versions: 0.12.0
> Reporter: Yin Huai
> Assignee: Yin Huai
> Attachments: HIVE-4827.1.patch, HIVE-4827.2.patch, HIVE-4827.3.patch,
> HIVE-4827.4.patch, HIVE-4827.5.patch, HIVE-4827.6.patch, HIVE-4827.7.patch
>
>
> When hive.optimize.mapjoin.mapreduce is on, CommonJoinResolver can attach a
> Map-only job (MapJoin) to its following MapReduce job. But this merge only
> happens when the MapReduce job has a single input. With Correlation Optimizer
> (HIVE-2206), it is possible that the MapReduce job can have multiple inputs
> (for multiple operation paths). It is desired to improve CommonJoinResolver
> to merge a Map-only job to the corresponding Map task of the MapReduce job.
> Example:
> {code:sql}
> set hive.optimize.correlation=true;
> set hive.auto.convert.join=true;
> set hive.optimize.mapjoin.mapreduce=true;
> SELECT tmp1.key, count(*)
> FROM (SELECT x1.key1 AS key
> FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
> GROUP BY x1.key1) tmp1
> JOIN (SELECT x2.key2 AS key
> FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key2 = y2.key2)
> GROUP BY x2.key2) tmp2
> ON (tmp1.key = tmp2.key)
> GROUP BY tmp1.key;
> {\code}
> In this query, join operations inside tmp1 and tmp2 will be converted to two
> MapJoins. With Correlation Optimizer, aggregations in tmp1, tmp2, and join of
> tmp1 and tmp2, and the last aggregation will be executed in the same
> MapReduce job (Reduce side). Since this MapReduce job has two inputs, right
> now, CommonJoinResolver cannot attach two MapJoins to the Map side of a
> MapReduce job.
> Another example:
> {code:sql}
> SELECT tmp1.key
> FROM (SELECT x1.key2 AS key
> FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
> UNION ALL
> SELECT x2.key2 AS key
> FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key1 = y2.key1)) tmp1
> {\code}
> For this case, we will have three Map-only jobs (two for MapJoins and one for
> Union). It will be good to use a single Map-only job to execute this query.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira