pgaref commented on a change in pull request #952: HIVE-23006 ProbeDecode
compiler support
URL: https://github.com/apache/hive/pull/952#discussion_r406155406
##########
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/TezCompiler.java
##########
@@ -1482,18 +1490,131 @@ private void
removeSemijoinsParallelToMapJoin(OptimizeTezProcContext procCtx)
deque.addAll(op.getChildOperators());
}
}
+ // No need to remove SJ branches when we have semi-join reduction or when
semijoins are enabled for parallel mapjoins.
+ if
(!procCtx.conf.getBoolVar(ConfVars.TEZ_DYNAMIC_SEMIJOIN_REDUCTION_FOR_MAPJOIN))
{
+ if (semijoins.size() > 0) {
+ for (Entry<ReduceSinkOperator, TableScanOperator> semiEntry :
semijoins.entrySet()) {
+ SemiJoinBranchInfo sjInfo =
procCtx.parseContext.getRsToSemiJoinBranchInfo().get(semiEntry.getKey());
+ if (sjInfo.getIsHint() || !sjInfo.getShouldRemove()) {
+ // Created by hint, skip it
+ continue;
+ }
+ if (LOG.isDebugEnabled()) {
+ LOG.debug("Semijoin optimization with parallel edge to map join.
Removing semijoin " +
+ OperatorUtils.getOpNamePretty(semiEntry.getKey()) + " - " +
OperatorUtils.getOpNamePretty(semiEntry.getValue()));
+ }
+ GenTezUtils.removeBranch(semiEntry.getKey());
+ GenTezUtils.removeSemiJoinOperator(procCtx.parseContext,
semiEntry.getKey(), semiEntry.getValue());
+ }
+ }
+ }
+ if (procCtx.conf.getBoolVar(ConfVars.HIVE_OPTIMIZE_SCAN_PROBEDECODE)) {
+ if (probeDecodeMJoins.size() > 0) {
Review comment:
When using MJ to add filtering on ProbeDecode (and not static filters) I
believe we should always keep the context as we dont really know how effective
the filter is going to be right? (as in we dont know how many HT keys are going
to match the particular column on the TS side)
In ORC-597 I did some experiments using existing datasets (github, sales,
etc) and found that even when we filter-out 20% of elements we dont add extra
overhead (due to row-level filtering) -- of course this depends on the data
type as well.
To tackle the above, I have have a runtime optimization as part of ORC Data
Consumer (part of LLap) that disables the filter when its not effective.
https://github.com/apache/hive/pull/926/files#diff-4137d272789978e35fd5f489f09da064R343
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]