[ 
https://issues.apache.org/jira/browse/HIVE-29625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-29625:
----------------------------------
    Labels: pull-request-available  (was: )

> Disambiguate ColStatistics.countDistinct "unknown" from "verified zero"
> -----------------------------------------------------------------------
>
>                 Key: HIVE-29625
>                 URL: https://issues.apache.org/jira/browse/HIVE-29625
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Konstantin Bereznyakov
>            Priority: Major
>              Labels: pull-request-available
>
> h2. Problem
>   {{ColStatistics.countDistinct}} (NDV) overloads the value {{0}}:
>   * *Verified zero* — the column genuinely has zero non-NULL distinct values
>     (all-NULL column, empty table).
>   * *Unknown* — upstream stats did not compute NDV; {{0}} leaks through as the
>     Thrift-primitive default from {{ColumnStatisticsObj.numDVs}}.
>   Downstream consumers cannot tell the two cases apart, so they apply 
> identical
>   fallback heuristics ({{numRows / 2}}, {{factor *= 0.5}}, {{MAX_VALUE}}, 
> etc.) to
>   both. For *verified zero* the heuristic is wrong (the true answer for
>   {{col = const}} is 0 matching rows), and for *unknown* it merely papers over
>   absent information.
>   The other count-style fields on {{ColStatistics}} ({{numNulls}}, 
> {{numTrues}},
>   {{numFalses}}) already follow the convention "negative = unknown, 0 = 
> verified
>   zero, positive = verified count" — established by HIVE-29438. 
> {{countDistinct}}
>   never got the same treatment.
>   h2. Convention after this change
>   For {{ColStatistics.countDistinct}}:
>   * *-1* (or any negative value) means *unknown* — NDV was not gathered or
>     cannot be determined.
>   * *0* means *verified zero* — the column has zero non-NULL distinct values.
>   * *positive value* means *verified count* — exactly that many distinct
>     non-NULL values.
>   This matches the existing convention for {{numNulls}}, {{numTrues}},
>   {{numFalses}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to