Konstantin Bereznyakov created HIVE-29625:
---------------------------------------------

             Summary: Disambiguate ColStatistics.countDistinct "unknown" from 
"verified zero"
                 Key: HIVE-29625
                 URL: https://issues.apache.org/jira/browse/HIVE-29625
             Project: Hive
          Issue Type: Improvement
            Reporter: Konstantin Bereznyakov


h2. Problem

  \{{ColStatistics.countDistinct}} (NDV) overloads the value \{{0}}:

  * *Verified zero* — the column genuinely has zero non-NULL distinct values
    (all-NULL column, empty table).
  * *Unknown* — upstream stats did not compute NDV; \{{0}} leaks through as the
    Thrift-primitive default from \{{ColumnStatisticsObj.numDVs}}.

  Downstream consumers cannot tell the two cases apart, so they apply identical
  fallback heuristics (\{{numRows / 2}}, \{{factor *= 0.5}}, \{{MAX_VALUE}}, 
etc.) to
  both. For *verified zero* the heuristic is wrong (the true answer for
  \{{col = const}} is 0 matching rows), and for *unknown* it merely papers over
  absent information.

  The other count-style fields on \{{ColStatistics}} (\{{numNulls}}, 
\{{numTrues}},
  \{{numFalses}}) already follow the convention "negative = unknown, 0 = 
verified
  zero, positive = verified count" — established by HIVE-29438. 
\{{countDistinct}}
  never got the same treatment.

  h2. Convention after this change

  For \{{ColStatistics.countDistinct}}:

  * *-1* (or any negative value) means *unknown* — NDV was not gathered or
    cannot be determined.
  * *0* means *verified zero* — the column has zero non-NULL distinct values.
  * *positive value* means *verified count* — exactly that many distinct
    non-NULL values.

  This matches the existing convention for \{{numNulls}}, \{{numTrues}},
  \{{numFalses}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to