jcamachor commented on a change in pull request #952: HIVE-23006 ProbeDecode
compiler support
URL: https://github.com/apache/hive/pull/952#discussion_r405765078
##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/TezCompiler.java
##########
@@ -1482,18 +1490,131 @@ private void
removeSemijoinsParallelToMapJoin(OptimizeTezProcContext procCtx)
deque.addAll(op.getChildOperators());
}
}
+ // No need to remove SJ branches when we have semi-join reduction or when
semijoins are enabled for parallel mapjoins.
+ if
(!procCtx.conf.getBoolVar(ConfVars.TEZ_DYNAMIC_SEMIJOIN_REDUCTION_FOR_MAPJOIN))
{
+ if (semijoins.size() > 0) {
+ for (Entry<ReduceSinkOperator, TableScanOperator> semiEntry :
semijoins.entrySet()) {
+ SemiJoinBranchInfo sjInfo =
procCtx.parseContext.getRsToSemiJoinBranchInfo().get(semiEntry.getKey());
+ if (sjInfo.getIsHint() || !sjInfo.getShouldRemove()) {
+ // Created by hint, skip it
+ continue;
+ }
+ if (LOG.isDebugEnabled()) {
+ LOG.debug("Semijoin optimization with parallel edge to map join.
Removing semijoin " +
+ OperatorUtils.getOpNamePretty(semiEntry.getKey()) + " - " +
OperatorUtils.getOpNamePretty(semiEntry.getValue()));
+ }
+ GenTezUtils.removeBranch(semiEntry.getKey());
+ GenTezUtils.removeSemiJoinOperator(procCtx.parseContext,
semiEntry.getKey(), semiEntry.getValue());
+ }
+ }
+ }
+ if (procCtx.conf.getBoolVar(ConfVars.HIVE_OPTIMIZE_SCAN_PROBEDECODE)) {
+ if (probeDecodeMJoins.size() > 0) {
Review comment:
The path for `HIVE_OPTIMIZE_SCAN_PROBEDECODE` seems independent from SJ
optimization. Should we add a mechanism to remove the context for the
optimization when we think it is not going to be beneficial, e.g., it is not
filtering any data? Or you think that the cost of checking is negligible and we
should always apply this optimization? What do you experiments show in the
worst case scenario? (In any case, this could be tackled in a follow-up but I
wanted to ask)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]