[
https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13710608#comment-13710608
]
Phabricator commented on HIVE-2206:
-----------------------------------
ashutoshc has commented on the revision "HIVE-2206 [jira] add a new optimizer
for query correlation discovery and optimization".
Few more comments. See which of these apply. If they doesn't apply, feel free
to ignore.
INLINE COMMENTS
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:250
What does this function do?
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:171
Will be good to add comment stating when table == null
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:153
It seems like lot of logic here is shared with CommonJoinTaskDispatcher. It
will be good to have that refactored so that its reusable here.
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:284
Seems like this method always return true. So, this is not required.
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:271
Add a comment saying that tree walking is done and now you will apply
transformations which you have detected.
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:590
Do we really need hasBeenRemoved() check?
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:597
getKeyCols().size() is not a good check. I will recommend to test explictly
for operators which we are supporting right now.
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:630
Do we still need this?
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:647
We should do jobFlowCorrelation as another pass in transform().
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:526
It will be good to add some ascii art which shows what tree structure we are
returning from this function.
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:368
It will good to add javadoc for this method.
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java:453
Didn't understand this comment. Probably we can word it better.
ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:61 I dont think
we need to revert to oldTag here. We can keep using newTag.
ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:59-60 Doesnt
look like you are using these OIs. Probably we can get rid of these.
ql/src/java/org/apache/hadoop/hive/ql/exec/DemuxOperator.java:174 It will be
good to add comments for whats the intent of this for loop.
ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:182 Why is it
called NumOriginalParents? can it be just numOfParents
ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:93 There is
already a forward in Demux, this should not be needed.
ql/src/java/org/apache/hadoop/hive/ql/exec/MuxOperator.java:75 You dont need
this constructor
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java:83
Looks like this map is not used anymore, lets get rid of this.
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java:79
It will be good to add comments about what this method is intending to do
ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/QueryPlanTreeTransformation.java:67
This method straight away calls another method. We can eliminate this one.
REVISION DETAIL
https://reviews.facebook.net/D11097
BRANCH
HIVE-2206-3671-20130711
ARCANIST PROJECT
hive
To: JIRA, ashutoshc, yhuai
Cc: brock
> add a new optimizer for query correlation discovery and optimization
> --------------------------------------------------------------------
>
> Key: HIVE-2206
> URL: https://issues.apache.org/jira/browse/HIVE-2206
> Project: Hive
> Issue Type: New Feature
> Components: Query Processor
> Affects Versions: 0.12.0
> Reporter: He Yongqiang
> Assignee: Yin Huai
> Attachments: HIVE-2206.10-r1384442.patch.txt,
> HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt,
> HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt,
> HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt,
> HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt,
> HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt,
> HIVE-2206.20-r1434012.patch.txt, HIVE-2206.2.patch.txt,
> HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt,
> HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt,
> HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt,
> HIVE-2206.D11097.10.patch, HIVE-2206.D11097.11.patch,
> HIVE-2206.D11097.12.patch, HIVE-2206.D11097.13.patch,
> HIVE-2206.D11097.14.patch, HIVE-2206.D11097.15.patch,
> HIVE-2206.D11097.16.patch, HIVE-2206.D11097.17.patch,
> HIVE-2206.D11097.18.patch, HIVE-2206.D11097.1.patch,
> HIVE-2206.D11097.2.patch, HIVE-2206.D11097.3.patch, HIVE-2206.D11097.4.patch,
> HIVE-2206.D11097.5.patch, HIVE-2206.D11097.6.patch, HIVE-2206.D11097.7.patch,
> HIVE-2206.D11097.8.patch, HIVE-2206.D11097.9.patch, testQueries.2.q,
> YSmartPatchForHive.patch
>
>
> This issue proposes a new logical optimizer called Correlation Optimizer,
> which is used to merge correlated MapReduce jobs (MR jobs) into a single MR
> job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/). The
> paper and slides of YSmart are linked at the bottom.
> Since Hive translates queries in a sentence by sentence fashion, for every
> operation which may need to shuffle the data (e.g. join and aggregation
> operations), Hive will generate a MapReduce job for that operation. However,
> for those operations which may need to shuffle the data, they may involve
> correlations explained below and thus can be executed in a single MR job.
> # Input Correlation: Multiple MR jobs have input correlation (IC) if their
> input relation sets are not disjoint;
> # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they
> have not only input correlation, but also the same partition key;
> # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its
> child nodes if it has the same partition key as that child node.
> The current implementation of correlation optimizer only detect correlations
> among MR jobs for reduce-side join operators and reduce-side aggregation
> operators (not map only aggregation). A query will be optimized if it
> satisfies following conditions.
> # There exists a MR job for reduce-side join operator or reduce side
> aggregation operator which have JFC with all of its parents MR jobs (TCs will
> be also exploited if JFC exists);
> # All input tables of those correlated MR job are original input tables (not
> intermediate tables generated by sub-queries); and
> # No self join is involved in those correlated MR jobs.
> Correlation optimizer is implemented as a logical optimizer. The main reasons
> are that it only needs to manipulate the query plan tree and it can leverage
> the existing component on generating MR jobs.
> Current implementation can serve as a framework for correlation related
> optimizations. I think that it is better than adding individual optimizers.
> There are several work that can be done in future to improve this optimizer.
> Here are three examples.
> # Support queries only involve TC;
> # Support queries in which input tables of correlated MR jobs involves
> intermediate tables; and
> # Optimize queries involving self join.
> References:
> Paper and presentation of YSmart.
> Paper:
> http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf
> Slides: http://sdrv.ms/UpwJJc
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira