konstantinb commented on code in PR #6244:
URL: https://github.com/apache/hive/pull/6244#discussion_r2651876376
##########
ql/src/java/org/apache/hadoop/hive/ql/stats/estimator/PessimisticStatCombiner.java:
##########
@@ -41,9 +41,15 @@ public void add(ColStatistics stat) {
if (stat.getAvgColLen() > result.getAvgColLen()) {
result.setAvgColLen(stat.getAvgColLen());
}
- if (stat.getCountDistint() > result.getCountDistint()) {
- result.setCountDistint(stat.getCountDistint());
- }
+
+ // NDVs can only be accurately combined if full information about
columns, query branches and
+ // their relationships is available. Without that info, there is only
one "truly conservative"
+ // value of NDV which is 0, which means that the NDV is unknown. It
forces optimizer
+ // to make the most conservative decisions possible, which is the exact
goal of
+ // PessimisticStatCombiner. It does inflate statistics in multiple
cases, but at the same time it
+ // also ensures than the query execution does not "blow up" due to too
optimistic stats estimates
+ result.setCountDistint(0L);
Review Comment:
This could appear counter-intuitive at first, however, when combining
statistics of different logical branches of the same column, and having no
reliable information about their interdependencies (i.e. in a "truly
pessimistic" scenario), every other option appears to introduce undesired
under-estimations, which often lead to catastrophic query failures.
For example, a simple column generated by an CASE..WHEN clause with three
constants produces an NDV of 1 by the original code, while, in most cases, the
"true" NDV is 3. If such a column participates in a GROUP BY condition later
on, its estimated number of records naturally becomes "1". Even this seemingly
small under-estimation could lead to bad decision of converting to a mapjoin or
not, especially over large data sets.
Alternatively, trying to "total up" NDV values of the same columns could
cause over-estimation of the true NDV of such a column, which, it its turn,
could lead to a severe underestimation of records matching an "IN" filter,
ultimately producing equally bad results as the previous case
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]