Indhumathi27 commented on code in PR #6202:
URL: https://github.com/apache/hive/pull/6202#discussion_r2729107754
##########
ql/src/java/org/apache/hadoop/hive/ql/parse/TezCompiler.java:
##########
@@ -1322,6 +1325,54 @@ private static void
runTopNKeyOptimization(OptimizeTezProcContext procCtx)
ogw.startWalking(topNodes, null);
}
+ /*
+ * Build the ReduceSink matching pattern used by TopNKey optimization.
+ *
+ * For ORDER BY / LIMIT queries that do not involve GROUP BY or JOIN,
+ * applying TopNKey results in a performance regression. ReduceSink
+ * operators created only for ordering must therefore be excluded from
+ * TopNKey.
+ *
+ * When ORDER BY or LIMIT is present, restrict TopNKey to ReduceSink
+ * operators that originate from GROUP BY, JOIN, MAPJOIN, LATERAL VIEW
+ * JOIN or PTF query shapes. SELECT and FILTER operators may appear in
+ * between.
+ */
+ private static String buildTopNKeyRegexPattern(OptimizeTezProcContext
procCtx) {
+ String reduceSinkOp = ReduceSinkOperator.getOperatorName() + "%";
+
+ boolean hasOrderOrLimit =
+ procCtx.parseContext.getQueryProperties().hasLimit() ||
+ procCtx.parseContext.getQueryProperties().hasOrderBy();
Review Comment:
Thanks for the feedback — it helped clarify the gaps in the earlier
comparison.
My previous runs used different datasets for ORDER BY and PTF, so I re-ran
all tests on the same 16M-row dataset and attached the reports here. With this
setup, PTF also shows a clear regression with TopNKey enabled for DESC-sorted
data:
PTF Sorted-DESC
TopNKey ON → Map output = 16M, Time = 26.6s
TopNKey OFF → Map output = 4, Time = 13.6s
In this case, TopNKey cannot prune at the map side, which blocks ReduceSink
pruning and forces all rows to be shuffled by partition. This aligns with the
observation that performance depends on how much the mapper can reduce before
the shuffle: for ORDER BY the reduction is very large, and for PTF it is
smaller in many cases, but for DESC-sorted input it drops to ~1× (no pruning),
which explains the visible slowdown.
I’ve attached the full 16M-row reports for both ORDER BY and PTF.
[TopNkeyTest.txt](https://github.com/user-attachments/files/24867674/TopNkeyTest.txt)
[TopNkeyTest_ptf.txt](https://github.com/user-attachments/files/24867678/TopNkeyTest_ptf.txt)
I’d appreciate your thoughts on how you’d like to proceed based on these
results. Happy to run more experiments or adjust the patch based on your
recommendation.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]