[ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13548809#comment-13548809 ]
Liu Zongquan commented on HIVE-2206: ------------------------------------ [~yhuai] Thanks so much! > add a new optimizer for query correlation discovery and optimization > -------------------------------------------------------------------- > > Key: HIVE-2206 > URL: https://issues.apache.org/jira/browse/HIVE-2206 > Project: Hive > Issue Type: New Feature > Components: Query Processor > Affects Versions: 0.10.0 > Reporter: He Yongqiang > Assignee: Yin Huai > Attachments: HIVE-2206.10-r1384442.patch.txt, > HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, > HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, > HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, > HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, > HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, > HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, > HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, > HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, > HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch > > > This issue proposes a new logical optimizer called Correlation Optimizer, > which is used to merge correlated MapReduce jobs (MR jobs) into a single MR > job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The > paper and slides of YSmart are linked at the bottom. > Since Hive translates queries in a sentence by sentence fashion, for every > operation which may need to shuffle the data (e.g. join and aggregation > operations), Hive will generate a MapReduce job for that operation. However, > for those operations which may need to shuffle the data, they may involve > correlations explained below and thus can be executed in a single MR job. > # Input Correlation: Multiple MR jobs have input correlation (IC) if their > input relation sets are not disjoint; > # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they > have not only input correlation, but also the same partition key; > # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its > child nodes if it has the same partition key as that child node. > The current implementation of correlation optimizer only detect correlations > among MR jobs for reduce-side join operators and reduce-side aggregation > operators (not map only aggregation). A query will be optimized if it > satisfies following conditions. > # There exists a MR job for reduce-side join operator or reduce side > aggregation operator which have JFC with all of its parents MR jobs (TCs will > be also exploited if JFC exists); > # All input tables of those correlated MR job are original input tables (not > intermediate tables generated by sub-queries); and > # No self join is involved in those correlated MR jobs. > Correlation optimizer is implemented as a logical optimizer. The main reasons > are that it only needs to manipulate the query plan tree and it can leverage > the existing component on generating MR jobs. > Current implementation can serve as a framework for correlation related > optimizations. I think that it is better than adding individual optimizers. > There are several work that can be done in future to improve this optimizer. > Here are three examples. > # Support queries only involve TC; > # Support queries in which input tables of correlated MR jobs involves > intermediate tables; and > # Optimize queries involving self join. > References: > Paper and presentation of YSmart. > Paper: > http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf > Slides: http://sdrv.ms/UpwJJc -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira