Re: [PR] HIVE-29322: Avoid TopNKeyOperator When ReduceSink TopNkey Filtering Provides Better Pruning for ORDER BY LIMIT Queries [hive]

via GitHub Mon, 19 Jan 2026 07:22:31 -0800


Indhumathi27 commented on code in PR #6202:
URL: https://github.com/apache/hive/pull/6202#discussion_r2705187020



##########
ql/src/java/org/apache/hadoop/hive/ql/parse/TezCompiler.java:
##########
@@ -1322,6 +1325,54 @@ private static void 
runTopNKeyOptimization(OptimizeTezProcContext procCtx)
     ogw.startWalking(topNodes, null);
   }
 
+  /*
+   * Build the ReduceSink matching pattern used by TopNKey optimization.
+   *
+   * For ORDER BY / LIMIT queries that do not involve GROUP BY or JOIN,
+   * applying TopNKey results in a performance regression. ReduceSink
+   * operators created only for ordering must therefore be excluded from
+   * TopNKey.
+   *
+   * When ORDER BY or LIMIT is present, restrict TopNKey to ReduceSink
+   * operators that originate from GROUP BY, JOIN, MAPJOIN, LATERAL VIEW
+   * JOIN or PTF query shapes. SELECT and FILTER operators may appear in
+   * between.
+   */
+  private static String buildTopNKeyRegexPattern(OptimizeTezProcContext 
procCtx) {
+    String reduceSinkOp = ReduceSinkOperator.getOperatorName() + "%";
+
+    boolean hasOrderOrLimit =
+            procCtx.parseContext.getQueryProperties().hasLimit() ||
+                    procCtx.parseContext.getQueryProperties().hasOrderBy();

Review Comment:
   > PS. The content of the 
[ptf_testcase.txt](https://github.com/user-attachments/files/24545845/ptf_testcase.txt)
 file is identical wih the previous run. The screenshots show inputs with 51M 
rows but the content of the attachment is not aligned.
   
   i have used 
[ptf_testcase.txt](https://github.com/user-attachments/files/24545845/ptf_testcase.txt)
 schema and generated more data for testing. I have performed insert overwrite 
select from same table.
   
   > I don't understand to what we what exactly we refer to by saying global 
shuffle. I would also like some more clarifications about the "reducer fan-in"
   
   Since there is a partitioning key exist in case of PTF queries, each reducer 
only receives rows for the partitions it owns. In case of simple order by limit 
queries, all data will have to go to single reducer.
   
   > Since the number of input records to Reducer 2 differs when TopNKey is 
enabled and disabled what exactly do you mean that the fan-in is not affected?
   
   The comparison is for PTF queries Vs Simple Order by Limit queries, not PTF 
TopNkey enabled vs disabled.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HIVE-29322: Avoid TopNKeyOperator When ReduceSink TopNkey Filtering Provides Better Pruning for ORDER BY LIMIT Queries [hive]

Reply via email to