[jira] [Created] (HIVE-4827) Merge a Map-only job to its following MapReduce job with multiple inputs

Yin Huai (JIRA) Mon, 08 Jul 2013 19:02:47 -0700

Yin Huai created HIVE-4827:
------------------------------

             Summary: Merge a Map-only job to its following MapReduce job with 
multiple inputs
                 Key: HIVE-4827
                 URL: https://issues.apache.org/jira/browse/HIVE-4827
             Project: Hive
          Issue Type: Improvement
            Reporter: Yin Huai
            Assignee: Yin Huai



When hive.optimize.mapjoin.mapreduce is on, CommonJoinResolver can attach a 
Map-only job (MapJoin) to its following MapReduce job. But this merge only 
happens when the MapReduce job has a single input. With Correlation Optimizer 
(HIVE-2206), it is possible that the MapReduce job can have multiple inputs 
(for multiple operation paths). It is desired to improve CommonJoinResolver to 
merge a Map-only job to the corresponding Map task of the MapReduce job.

Example:
{code:sql}
SELECT tmp1.key, count(*)
FROM (SELECT x1.key2 AS key
      FROM bigTable1 x1 JOIN smallTable1 y1 ON (x1.key1 = y1.key1)
      GROUP BY x1.key2) tmp1
JOIN (SELECT x2.key2 AS key
      FROM bigTable2 x2 JOIN smallTable2 y2 ON (x2.key1 = y2.key1)
      GROUP BY x2.key2) tmp2
ON (tmp1.key = tmp2.key)
GROUP BY tmp1.key;
{\code}
In this query, join operations inside tmp1 and tmp2 will be converted to two 
MapJoins. With Correlation Optimizer, aggregations in tmp1, tmp2, and join of 
tmp1 and tmp2, and the last aggregation will be executed in the same MapReduce 
job (Reduce side). Since this MapReduce job has two inputs, right now, 
CommonJoinResolver cannot attach two MapJoins to the Map side of a MapReduce 
job.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (HIVE-4827) Merge a Map-only job to its following MapReduce job with multiple inputs

Reply via email to