Re: [PR] HIVE-29322: Avoid TopNKeyOperator When Map-Side LIMIT Pushdown Provides Better Pruning for ORDER BY LIMIT Queries [hive]

via GitHub Wed, 31 Dec 2025 07:46:58 -0800


zabetak commented on code in PR #6202:
URL: https://github.com/apache/hive/pull/6202#discussion_r2655505625



##########
ql/src/test/results/clientpositive/llap/lateral_view.q.out:
##########
@@ -100,22 +90,17 @@ STAGE PLANS:
               Limit
                 Number of rows: 1
                 Statistics: Num rows: 1 Data size: 178 Basic stats: COMPLETE 
Column stats: COMPLETE
-                Top N Key Operator
-                  sort order: ++
-                  keys: KEY.reducesinkkey0 (type: string), KEY.reducesinkkey1 
(type: int)
-                  null sort order: zz
+                Select Operator
+                  expressions: KEY.reducesinkkey0 (type: string), VALUE._col0 
(type: string), KEY.reducesinkkey1 (type: int)
+                  outputColumnNames: _col0, _col1, _col2
                   Statistics: Num rows: 1 Data size: 178 Basic stats: COMPLETE 
Column stats: COMPLETE
-                  top n: 1
-                  Select Operator
-                    expressions: KEY.reducesinkkey0 (type: string), 
VALUE._col0 (type: string), KEY.reducesinkkey1 (type: int)
-                    outputColumnNames: _col0, _col1, _col2
+                  Reduce Output Operator
+                    key expressions: _col0 (type: string), _col2 (type: int)
+                    null sort order: zz
+                    sort order: ++
                     Statistics: Num rows: 1 Data size: 178 Basic stats: 
COMPLETE Column stats: COMPLETE
-                    Reduce Output Operator
-                      key expressions: _col0 (type: string), _col2 (type: int)
-                      null sort order: zz
-                      sort order: ++
-                      Statistics: Num rows: 1 Data size: 178 Basic stats: 
COMPLETE Column stats: COMPLETE
-                      value expressions: _col1 (type: string)
+                    TopN Hash Memory Usage: 0.1

Review Comment:
   When the plan contains a `Limit` operator all subsequent TopN optimizations 
in the same mapper/reducer seem redundant. It's out of scope of the current PR 
but it may be worth logging a separate JIRA ticket as a potential improvement.



##########
ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java:
##########
@@ -141,6 +141,9 @@ private ReducerTraits(int trait) {
   // used to decide whether global order is needed
   private transient boolean hasOrderBy = false;
 
+  // used to decide whether topn key optimisation can be applied
+  private transient boolean hasOnlyOrderByLimit = false;

Review Comment:
   Do we really need this new indicator here? IT seems to be a hint about the 
structure of the plan rather than a property/quality of the `ReduceSink` 
operator.



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/topnkey/TopNKeyProcessor.java:
##########
@@ -71,6 +71,13 @@ public Object process(Node nd, Stack<Node> stack, 
NodeProcessorCtx procCtx,
       return null;
     }
 
+    // Skip the current optimization when a simple global ORDER BY...LIMIT is 
present
+    // (topN > -1 and hasOnlyOrderByLimit()).
+    // This plan structure is handled more efficiently by the specialized 
'TopN In Reducer' optimization.
+    if (reduceSinkDesc.getTopN() > -1 && reduceSinkDesc.hasOnlyOrderByLimit()) 
{

Review Comment:
   It seems that we don't want to introduce a `TopNKeyOperator` if the path 
between TS and RS does not contain a JOIN or GBY operator. We can add such 
checks at some point here or maybe better modify the respective `RuleRegExp` to 
only trigger when there is a GBY or JOIN in the path.



##########
ql/src/java/org/apache/hadoop/hive/ql/optimizer/LimitPushdownOptimizer.java:
##########
@@ -117,18 +118,29 @@ private static class TopNReducer implements 
SemanticNodeProcessor {
     @Override
     public Object process(Node nd, Stack<Node> stack,
                           NodeProcessorCtx procCtx, Object... nodeOutputs) 
throws SemanticException {
+      boolean hasOnlyOrderByLimit = true;
       ReduceSinkOperator rs = null;
       for (int i = stack.size() - 2 ; i >= 0; i--) {
         Operator<?> operator = (Operator<?>) stack.get(i);
-        if (operator.getNumChild() != 1) {
-          return false; // multi-GBY single-RS (TODO)
-        }
-        if (operator instanceof ReduceSinkOperator) {
-          rs = (ReduceSinkOperator) operator;
-          break;
+
+        if (operator instanceof GroupByOperator || operator instanceof 
JoinOperator) {
+          hasOnlyOrderByLimit = false;

Review Comment:
   The decision to introduce or not a `TopNKeyOperator` is a responsibility of 
the the `TopNKeyProcessor` so if we need to make decisions based on the 
structure of the plan it makes more sense to do put the GBY/JOIN checks there.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HIVE-29322: Avoid TopNKeyOperator When Map-Side LIMIT Pushdown Provides Better Pruning for ORDER BY LIMIT Queries [hive]

Reply via email to