zabetak commented on code in PR #6202:
URL: https://github.com/apache/hive/pull/6202#discussion_r2711328802


##########
ql/src/java/org/apache/hadoop/hive/ql/parse/TezCompiler.java:
##########
@@ -1322,6 +1325,54 @@ private static void 
runTopNKeyOptimization(OptimizeTezProcContext procCtx)
     ogw.startWalking(topNodes, null);
   }
 
+  /*
+   * Build the ReduceSink matching pattern used by TopNKey optimization.
+   *
+   * For ORDER BY / LIMIT queries that do not involve GROUP BY or JOIN,
+   * applying TopNKey results in a performance regression. ReduceSink
+   * operators created only for ordering must therefore be excluded from
+   * TopNKey.
+   *
+   * When ORDER BY or LIMIT is present, restrict TopNKey to ReduceSink
+   * operators that originate from GROUP BY, JOIN, MAPJOIN, LATERAL VIEW
+   * JOIN or PTF query shapes. SELECT and FILTER operators may appear in
+   * between.
+   */
+  private static String buildTopNKeyRegexPattern(OptimizeTezProcContext 
procCtx) {
+    String reduceSinkOp = ReduceSinkOperator.getOperatorName() + "%";
+
+    boolean hasOrderOrLimit =
+            procCtx.parseContext.getQueryProperties().hasLimit() ||
+                    procCtx.parseContext.getQueryProperties().hasOrderBy();

Review Comment:
   The description and analysis under HIVE-29322 imply that the performance 
difference with/without `TopNKeyOperator` is directly related to the number of 
records pruned by the mapper and processed by the reducer. It's not clear why 
the number of records matters for simple ORDER BY queries and does not matter 
for simple windowing queries.
   
   In the runs with ORDER BY the 16M INPUT_RECORDS  were pruned down to 100 
OUTPUT_RECORDS. The decrease is ~**100,000** smaller.
   In the runs with window functions the 51M INPUT_RECORDS were pruned down 6M 
OUTPUT_RECORDS. The decrease is only ~**10** smaller. 
   
   I suspect that there is no visible different in performance because the 
reduction factor is still very different between the experiments.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to