konstantinb commented on code in PR #6359:
URL: https://github.com/apache/hive/pull/6359#discussion_r2988884907
##########
ql/src/java/org/apache/hadoop/hive/ql/stats/estimator/PessimisticStatCombiner.java:
##########
@@ -41,9 +42,14 @@ public void add(ColStatistics stat) {
if (stat.getAvgColLen() > result.getAvgColLen()) {
result.setAvgColLen(stat.getAvgColLen());
}
- if (stat.getCountDistint() > result.getCountDistint()) {
- result.setCountDistint(stat.getCountDistint());
+ // NDV=0 is "unknown" only if the stat is NOT a constant.
+ // Constants with NDV=0 (e.g., NULL) are "known zero", not unknown.
+ if ((result.getCountDistint() == 0 && !result.isConst()) ||
(stat.getCountDistint() == 0 && !stat.isConst())) {
+ result.setCountDistint(0);
Review Comment:
@zabetak, this is the most complicated problem to solve, in my opinion. The
following code:
https://github.com/apache/hive/blob/931d4bb62b26de699240c816df439e00644e3dcb/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L1634
is very explicit in assigning an NDV of 0 to NULL constants and 1 to non-NULL
constants. At the same time, an NDV of "0" for a source column is typically
used to indicate that the NDV for the column. is "unknown", which could matter
a lot for large tables. Therefore, simply "summing" an NDV of 0 introduces even
bigger mis-estimations
To compensate for the "0 NDV" null constant, The following code:
https://github.com/apache/hive/blob/931d4bb62b26de699240c816df439e00644e3dcb/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java#L2113
has a "+1" adjustment when figuring out NDVs of a GROUP BY.
I am thinking of modifying buildColStatForConstant() to treat NULL values as
regular constants and see if we run into any significant side effects. If you
have any additional thoughts on the subject, I would greatly appreciate knowing
those. Thank you!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]