zabetak commented on code in PR #6202:
URL: https://github.com/apache/hive/pull/6202#discussion_r2675901598


##########
ql/src/java/org/apache/hadoop/hive/ql/parse/TezCompiler.java:
##########
@@ -1322,6 +1325,54 @@ private static void 
runTopNKeyOptimization(OptimizeTezProcContext procCtx)
     ogw.startWalking(topNodes, null);
   }
 
+  /*
+   * Build the ReduceSink matching pattern used by TopNKey optimization.
+   *
+   * For ORDER BY / LIMIT queries that do not involve GROUP BY or JOIN,
+   * applying TopNKey results in a performance regression. ReduceSink
+   * operators created only for ordering must therefore be excluded from
+   * TopNKey.
+   *
+   * When ORDER BY or LIMIT is present, restrict TopNKey to ReduceSink
+   * operators that originate from GROUP BY, JOIN, MAPJOIN, LATERAL VIEW
+   * JOIN or PTF query shapes. SELECT and FILTER operators may appear in
+   * between.
+   */
+  private static String buildTopNKeyRegexPattern(OptimizeTezProcContext 
procCtx) {
+    String reduceSinkOp = ReduceSinkOperator.getOperatorName() + "%";
+
+    boolean hasOrderOrLimit =
+            procCtx.parseContext.getQueryProperties().hasLimit() ||
+                    procCtx.parseContext.getQueryProperties().hasOrderBy();

Review Comment:
   ```sql
   select * 
   from ( select p_mfgr, rank() over(partition by p_mfgr order by p_name) r 
from part) a 
   where r < 4;
   ```
   Should such queries use the `Top N Key Operator`? 
   
   #### Plan A: With Top N Key Operator
   ```
           Map 1 
               Map Operator Tree:
                   TableScan
                     alias: part
                     Statistics: Num rows: 26 Data size: 5694 Basic stats: 
COMPLETE Column stats: COMPLETE
                     Top N Key Operator
                       sort order: ++
                       keys: p_mfgr (type: string), p_name (type: string)
                       null sort order: az
                       Map-reduce partition columns: p_mfgr (type: string)
                       Statistics: Num rows: 26 Data size: 5694 Basic stats: 
COMPLETE Column stats: COMPLETE
                       top n: 4
                       Reduce Output Operator
                         key expressions: p_mfgr (type: string), p_name (type: 
string)
                         null sort order: az
                         sort order: ++
                         Map-reduce partition columns: p_mfgr (type: string)
                         Statistics: Num rows: 26 Data size: 5694 Basic stats: 
COMPLETE Column stats: COMPLETE
   ```
   #### Plan B: Without Top N Key Operator
   ```
           Map 1 
               Map Operator Tree:
                   TableScan
                     alias: part
                     Statistics: Num rows: 26 Data size: 5694 Basic stats: 
COMPLETE Column stats: COMPLETE
                     Reduce Output Operator
                       key expressions: p_mfgr (type: string), p_name (type: 
string)
                       null sort order: az
                       sort order: ++
                       Map-reduce partition columns: p_mfgr (type: string)
                       Statistics: Num rows: 26 Data size: 5694 Basic stats: 
COMPLETE Column stats: COMPLETE
                       TopN Hash Memory Usage: 0.8
   ```
   The plan structure is almost identical to the case of ORDER BY + LIMIT 
queries so from the discussion so far, I was under the impression that "Plan B" 
is better and more efficient in most cases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to