[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Hudson (JIRA) Thu, 18 Jul 2013 04:41:18 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13712239#comment-13712239
 ]


Hudson commented on HIVE-2206:
------------------------------

FAILURE: Integrated in Hive-trunk-h0.21 #2205 (See 
[https://builds.apache.org/job/Hive-trunk-h0.21/2205/])
HIVE-2206 [jira] add a new optimizer for query correlation discovery and 
optimization
(Yin Huai via Ashutosh Chauhan)

Summary:
update test results

This issue proposes a new logical optimizer called Correlation Optimizer, which 
is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The 
idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The paper and 
slides of YSmart are linked at the bottom.

Since Hive translates queries in a sentence by sentence fashion, for every 
operation which may need to shuffle the data (e.g. join and aggregation 
operations), Hive will generate a MapReduce job for that operation. However, 
for those operations which may need to shuffle the data, they may involve 
correlations explained below and thus can be executed in a single MR job.

        Input Correlation: Multiple MR jobs have input correlation (IC) if 
their input relation sets are not disjoint;
        Transit Correlation: Multiple MR jobs have transit correlation (TC) if 
they have not only input correlation, but also the same partition key;
        Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of 
its child nodes if it has the same partition key as that child node.

The current implementation of correlation optimizer only detect correlations 
among MR jobs for reduce-side join operators and reduce-side aggregation 
operators (not map only aggregation). A query will be optimized if it satisfies 
following conditions.

        There exists a MR job for reduce-side join operator or reduce side 
aggregation operator which have JFC with all of its parents MR jobs (TCs will 
be also exploited if JFC exists);
        All input tables of those correlated MR job are original input tables 
(not intermediate tables generated by sub-queries); and
        No self join is involved in those correlated MR jobs.

Correlation optimizer is implemented as a logical optimizer. The main reasons 
are that it only needs to manipulate the query plan tree and it can leverage 
the existing component on generating MR jobs.

Current implementation can serve as a framework for correlation related 
optimizations. I think that it is better than adding individual optimizers.

There are several work that can be done in future to improve this optimizer. 
Here are three examples.

        Support queries only involve TC;
        Support queries in which input tables of correlated MR jobs involves 
intermediate tables; and
        Optimize queries involving self join.

References:
Paper and presentation of YSmart.
Paper: 
http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
Slides: http://sdrv.ms/UpwJJc

Test Plan: EMPTY

Reviewers: JIRA, ashutoshc

Reviewed By: ashutoshc

CC: brock

Differential Revision: https://reviews.facebook.net/D11097 (hashutosh: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1504395)
* /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
* /hive/trunk/conf/hive-default.xml.template
* /hive/trunk/ql/if/queryplan.thrift
* 
/hive/trunk/ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/CommonJoinOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/ExecReducer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMRUnion1.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkDeDuplication.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/AbstractCorrelationProcCtx.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationUtilities.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/IntraQueryCorrelation.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/DemuxDesc.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/MuxDesc.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java
* /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/UnionDesc.java
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer1.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer10.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer11.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer12.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer13.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer14.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer2.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer3.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer4.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer5.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer6.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer7.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer8.q
* /hive/trunk/ql/src/test/queries/clientpositive/correlationoptimizer9.q
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer1.q.out
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer10.q.out
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer11.q.out
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer12.q.out
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer13.q.out
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer14.q.out
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer2.q.out
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer3.q.out
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer4.q.out
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer5.q.out
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer6.q.out
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer7.q.out
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer8.q.out
* /hive/trunk/ql/src/test/results/clientpositive/correlationoptimizer9.q.out
* /hive/trunk/ql/src/test/results/compiler/plan/groupby2.q.xml
* /hive/trunk/ql/src/test/results/compiler/plan/groupby3.q.xml

                
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
>                 Key: HIVE-2206
>                 URL: https://issues.apache.org/jira/browse/HIVE-2206
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>    Affects Versions: 0.12.0
>            Reporter: He Yongqiang
>            Assignee: Yin Huai
>             Fix For: 0.12.0
>
>         Attachments: HIVE-2206.10-r1384442.patch.txt, 
> HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, 
> HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, 
> HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, 
> HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, 
> HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, 
> HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt, 
> HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, 
> HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, 
> HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, 
> HIVE-2206.D11097.10.patch, HIVE-2206.D11097.11.patch, 
> HIVE-2206.D11097.12.patch, HIVE-2206.D11097.13.patch, 
> HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch, 
> HIVE-2206.D11097.16.patch, HIVE-2206.D11097.17.patch, 
> HIVE-2206.D11097.18.patch, HIVE-2206.D11097.19.patch, 
> HIVE-2206.D11097.1.patch, HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, 
> HIVE-2206.D11097.4.patch, HIVE-2206.D11097.5.patch, HIVE-2206.D11097.6.patch, 
> HIVE-2206.D11097.7.patch, HIVE-2206.D11097.8.patch, HIVE-2206.D11097.9.patch, 
> HIVE-2206.patch, testQueries.2.q, YSmartPatchForHive.patch
>
>
> This issue proposes a new logical optimizer called Correlation Optimizer, 
> which is used to merge correlated MapReduce jobs (MR jobs) into a single MR 
> job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The 
> paper and slides of YSmart are linked at the bottom.
> Since Hive translates queries in a sentence by sentence fashion, for every 
> operation which may need to shuffle the data (e.g. join and aggregation 
> operations), Hive will generate a MapReduce job for that operation. However, 
> for those operations which may need to shuffle the data, they may involve 
> correlations explained below and thus can be executed in a single MR job.
> # Input Correlation: Multiple MR jobs have input correlation (IC) if their 
> input relation sets are not disjoint;
> # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they 
> have not only input correlation, but also the same partition key;
> # Job Flow Correlation: An MR has job ﬂow correlation (JFC) with one of its 
> child nodes if it has the same partition key as that child node.
> The current implementation of correlation optimizer only detect correlations 
> among MR jobs for reduce-side join operators and reduce-side aggregation 
> operators (not map only aggregation). A query will be optimized if it 
> satisfies following conditions.
> # There exists a MR job for reduce-side join operator or reduce side 
> aggregation operator which have JFC with all of its parents MR jobs (TCs will 
> be also exploited if JFC exists);
> # All input tables of those correlated MR job are original input tables (not 
> intermediate tables generated by sub-queries); and 
> # No self join is involved in those correlated MR jobs.
> Correlation optimizer is implemented as a logical optimizer. The main reasons 
> are that it only needs to manipulate the query plan tree and it can leverage 
> the existing component on generating MR jobs.
> Current implementation can serve as a framework for correlation related 
> optimizations. I think that it is better than adding individual optimizers. 
> There are several work that can be done in future to improve this optimizer. 
> Here are three examples.
> # Support queries only involve TC;
> # Support queries in which input tables of correlated MR jobs involves 
> intermediate tables; and 
> # Optimize queries involving self join. 
> References:
> Paper and presentation of YSmart.
> Paper: 
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
> Slides: http://sdrv.ms/UpwJJc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization

Reply via email to