zabetak commented on code in PR #6202:
URL: https://github.com/apache/hive/pull/6202#discussion_r2760082186
##########
ql/src/java/org/apache/hadoop/hive/ql/parse/TezCompiler.java:
##########
@@ -1322,6 +1325,54 @@ private static void
runTopNKeyOptimization(OptimizeTezProcContext procCtx)
ogw.startWalking(topNodes, null);
}
+ /*
+ * Build the ReduceSink matching pattern used by TopNKey optimization.
+ *
+ * For ORDER BY / LIMIT queries that do not involve GROUP BY or JOIN,
+ * applying TopNKey results in a performance regression. ReduceSink
+ * operators created only for ordering must therefore be excluded from
+ * TopNKey.
+ *
+ * When ORDER BY or LIMIT is present, restrict TopNKey to ReduceSink
+ * operators that originate from GROUP BY, JOIN, MAPJOIN, LATERAL VIEW
+ * JOIN or PTF query shapes. SELECT and FILTER operators may appear in
+ * between.
+ */
+ private static String buildTopNKeyRegexPattern(OptimizeTezProcContext
procCtx) {
+ String reduceSinkOp = ReduceSinkOperator.getOperatorName() + "%";
+
+ boolean hasOrderOrLimit =
+ procCtx.parseContext.getQueryProperties().hasLimit() ||
+ procCtx.parseContext.getQueryProperties().hasOrderBy();
Review Comment:
Many thanks for the additional results @Indhumathi27 they offer valuable
insights that can definitely help us to push this work forward.
Looking into the results there is one particular case (i.e., "Sorted DESC")
where TopNKey ON leads to rather bad pruning & performance but I am not sure if
we should base our decision to disable the optimization on this sole scenario.
Another interesting case is "Duplicate + Random" where despite the fact that we
are sending 2.4M records from the mapper to reducer when TopNKey is ON the "Map
Time" is lower than when TopNKey is OFF and the overall response time of the
query is identical. Worth mentioning that the "ineffective" pruning on the
mapper (2.4M vs 700) didn't affect significantly the Map/Reduce time that makes
me skeptical of shifting the focus to data pruning.
Overall, I feel that we should run more experiments to cover some additional
distributions in order to take a more informed decision regarding this
optimization. Below I outline a few suggestions on the experimental setup for
making the analysis more systematic and potentially help in extracting better
conclusions.
First of all, I would suggest to create a dedicated GitHub repository (e.g.,
https://github.com/Indhumathi27/HIVE-29322) for holding all the details around
these set of experiments including setup scripts, raw results, runtime
counters, and analysis results. Apart from the code changes, you have put a
tremendous effort on these benchmarks so it would be nice to conclude the work
with a nice blog post/article that summarizes our findings in this area. If you
are interested, I would be more than happy in helping out writing it.
Staring out with the dataset, there are two important factors that we should
consider: i) distribution and ii) layout.
For the distribution, we should test the following: Zipfian, and a few
uniform ordinal distributions. For the uniform distributions, we can pick 3
variants for varying the pruning efficiency (best, worst, borderline). In order
to pick the domain size (D) for the uniform distributions we can use the
following formula:
```
D = N / (ρK)
```
where N is the total number of rows, K the limit, and ρ the pruning
difficulty. I think for this set of experiments we can keep N and K stable and
vary `ρ ∈ {0.01, 1, 10}`. So with N=10M and K=100 the domain size for the three
uniform distributions is 10M, 100K, 10K. In the latest run, you also covered a
few more degenerate cases like `ρ=100` and `ρ=100K` where D is essentially 1
(constant) that we can still keep but are less important in terms of realistic
use-cases.
Having a Zipfian distribution, is a must since real data are usually skewed
and many real use-cases fall into this category.
For the layout, we basically have two options: random, and sorted. For the
sorted layout we don't really need both ASC and DESC since we can take the
direction into account in the query itself.
```sql
Q1: select * from seg1_orc order by h asc limit 100;
Q2: select * from seg1_orc order by h desc limit 100;
```
Having multiple queries with different sort direction is easier to follow
than having multiple datasets.
For the simple ORDER BY experiments it is rather safe to focus exclusively
on the distributions of the scoring column (i.e., `h`). However, for the window
queries (PTF) the partitioning column (`a`) is important as well:
```sql
SELECT a,ranking
FROM (
SELECT
a,
h,
RANK() OVER (PARTITION BY a ORDER BY h) AS ranking
FROM seg1_orc
) tmp
WHERE ranking <= 3;
```
The width of the table (number of columns) and how many of those are
returned to the user does not affect the pruning behavior but definitely has an
impact on processing/response time for each operator. Since in the ORDER BY
experiments we return all columns for keeping things consistent we should do
the same for the window query.
Finally, there are various caching layers that can have an impact on the
operator execution time so I would suggest to run each query multiple times
(min 3) for reporting the results. Apart from the aggregated results please
keep somewhere the raw results so that we can inspect them later on if needed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]