zabetak commented on code in PR #6202:
URL: https://github.com/apache/hive/pull/6202#discussion_r2760082186


##########
ql/src/java/org/apache/hadoop/hive/ql/parse/TezCompiler.java:
##########
@@ -1322,6 +1325,54 @@ private static void 
runTopNKeyOptimization(OptimizeTezProcContext procCtx)
     ogw.startWalking(topNodes, null);
   }
 
+  /*
+   * Build the ReduceSink matching pattern used by TopNKey optimization.
+   *
+   * For ORDER BY / LIMIT queries that do not involve GROUP BY or JOIN,
+   * applying TopNKey results in a performance regression. ReduceSink
+   * operators created only for ordering must therefore be excluded from
+   * TopNKey.
+   *
+   * When ORDER BY or LIMIT is present, restrict TopNKey to ReduceSink
+   * operators that originate from GROUP BY, JOIN, MAPJOIN, LATERAL VIEW
+   * JOIN or PTF query shapes. SELECT and FILTER operators may appear in
+   * between.
+   */
+  private static String buildTopNKeyRegexPattern(OptimizeTezProcContext 
procCtx) {
+    String reduceSinkOp = ReduceSinkOperator.getOperatorName() + "%";
+
+    boolean hasOrderOrLimit =
+            procCtx.parseContext.getQueryProperties().hasLimit() ||
+                    procCtx.parseContext.getQueryProperties().hasOrderBy();

Review Comment:
   Many thanks for the additional results @Indhumathi27 they offer valuable 
insights that can definitely help us to push this work forward. 
   
   Looking into the results there is one particular case (i.e., "Sorted DESC") 
where TopNKey ON leads to rather bad pruning & performance but I am not sure if 
we should base our decision to disable the optimization on this sole scenario. 
Another interesting case is "Duplicate + Random" where despite the fact that we 
are sending 2.4M records from the mapper to reducer when TopNKey is ON the "Map 
Time" is lower than when TopNKey is OFF and the overall response time of the 
query is identical. Worth mentioning that the "ineffective" pruning on the 
mapper (2.4M vs 700) didn't affect significantly the Map/Reduce time that makes 
me skeptical of shifting the focus to data pruning. 
   
   Overall, I feel that we should run more experiments to cover some additional 
distributions in order to take a more informed decision regarding this 
optimization. Below I outline a few suggestions on the experimental setup for 
making the analysis more systematic and potentially help in extracting better 
conclusions.
   
   First of all, I would suggest to create a dedicated GitHub repository (e.g., 
https://github.com/Indhumathi27/HIVE-29322) for holding all the details around 
these set of experiments including setup scripts, raw results, runtime 
counters, and analysis results. Apart from the code changes, you have put a 
tremendous effort on these benchmarks so it would be nice to conclude the work 
with a nice blog post/article that summarizes our findings in this area. If you 
are interested, I would be more than happy in helping out writing it. 
   
   Staring out with the dataset, there are two important factors that we should 
consider: i) distribution and ii) layout. 
   
   For the distribution, we should test the following: Zipfian, and a few 
uniform ordinal distributions. For the uniform distributions, we can pick 3 
variants for varying the pruning efficiency (best, worst, borderline). In order 
to pick the domain size (D) for the uniform distributions we can use the 
following formula: 
   ```
   D = N / (ρK)
   ```
   where N is the total number of rows, K the limit, and ρ the pruning 
difficulty. I think for this set of experiments we can keep N and K stable and 
vary `ρ ∈ {0.01, 1, 10}`. So with N=10M and K=100 the domain size for the three 
uniform distributions is 10M, 100K, 10K. In the latest run, you also covered a 
few more degenerate cases like `ρ=100` and `ρ=100K` where D is essentially 1 
(constant) that we can still keep but are less important in terms of realistic 
use-cases.
   Having a Zipfian distribution, is a must since real data are usually skewed 
and many real use-cases fall into this category.
   
   For the layout, we basically have two options: random, and sorted. For the 
sorted layout we don't really need both ASC and DESC since we can take the 
direction into account in the query itself. 
   ```sql
   Q1: select * from seg1_orc order by h asc limit 100;
   Q2: select * from seg1_orc order by h desc limit 100;
   ```
   Having multiple queries with different sort direction is easier to follow 
than having multiple datasets.
   
   For the simple ORDER BY experiments it is rather safe to focus exclusively 
on the distributions of the scoring column (i.e., `h`). However, for the window 
queries (PTF) the partitioning column (`a`) is important as well:
   
   ```sql
   SELECT a,ranking
   FROM (
     SELECT
       a,
       h,
       RANK() OVER (PARTITION BY a ORDER BY h) AS ranking
     FROM seg1_orc
   ) tmp
   WHERE ranking <= 3;
   ``` 
   The width of the table (number of columns) and how many of those are 
returned to the user does not affect the pruning behavior but definitely has an 
impact on processing/response time for each operator. Since in the ORDER BY 
experiments we return all columns for keeping things consistent we should do 
the same for the window query. 
   
   Finally, there are various caching layers that can have an impact on the 
operator execution time so I would suggest to run each query multiple times 
(min 3) for reporting the results. Apart from the aggregated results please 
keep somewhere the raw results so that we can inspect them later on if needed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to