Konstantin Bereznyakov created HIVE-29625:
---------------------------------------------
Summary: Disambiguate ColStatistics.countDistinct "unknown" from
"verified zero"
Key: HIVE-29625
URL: https://issues.apache.org/jira/browse/HIVE-29625
Project: Hive
Issue Type: Improvement
Reporter: Konstantin Bereznyakov
h2. Problem
\{{ColStatistics.countDistinct}} (NDV) overloads the value \{{0}}:
* *Verified zero* — the column genuinely has zero non-NULL distinct values
(all-NULL column, empty table).
* *Unknown* — upstream stats did not compute NDV; \{{0}} leaks through as the
Thrift-primitive default from \{{ColumnStatisticsObj.numDVs}}.
Downstream consumers cannot tell the two cases apart, so they apply identical
fallback heuristics (\{{numRows / 2}}, \{{factor *= 0.5}}, \{{MAX_VALUE}},
etc.) to
both. For *verified zero* the heuristic is wrong (the true answer for
\{{col = const}} is 0 matching rows), and for *unknown* it merely papers over
absent information.
The other count-style fields on \{{ColStatistics}} (\{{numNulls}},
\{{numTrues}},
\{{numFalses}}) already follow the convention "negative = unknown, 0 =
verified
zero, positive = verified count" — established by HIVE-29438.
\{{countDistinct}}
never got the same treatment.
h2. Convention after this change
For \{{ColStatistics.countDistinct}}:
* *-1* (or any negative value) means *unknown* — NDV was not gathered or
cannot be determined.
* *0* means *verified zero* — the column has zero non-NULL distinct values.
* *positive value* means *verified count* — exactly that many distinct
non-NULL values.
This matches the existing convention for \{{numNulls}}, \{{numTrues}},
\{{numFalses}}.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)