[ 
https://issues.apache.org/jira/browse/HIVE-29625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Bereznyakov updated HIVE-29625:
------------------------------------------
    Description: 
h2. Problem

  {{ColStatistics.countDistinct}} (NDV) overloads the value {{0}}:

  * *Verified zero* — the column genuinely has zero non-NULL distinct values
    (all-NULL column, empty table).
  * *Unknown* — upstream stats did not compute NDV; {{0}} leaks through as the
    Thrift-primitive default from {{ColumnStatisticsObj.numDVs}}.

  Downstream consumers cannot tell the two cases apart, so they apply identical
  fallback heuristics ({{numRows / 2}}, {{factor *= 0.5}}, {{MAX_VALUE}}, etc.) 
to
  both. For *verified zero* the heuristic is wrong (the true answer for
  {{col = const}} is 0 matching rows), and for *unknown* it merely papers over
  absent information.

  The other count-style fields on {{ColStatistics}} ({{numNulls}}, {{numTrues}},
  {{numFalses}}) already follow the convention "negative = unknown, 0 = verified
  zero, positive = verified count" — established by HIVE-29438. 
{{countDistinct}}
  never got the same treatment.

  h2. Convention after this change

  For {{ColStatistics.countDistinct}}:

  * *-1* (or any negative value) means *unknown* — NDV was not gathered or
    cannot be determined.
  * *0* means *verified zero* — the column has zero non-NULL distinct values.
  * *positive value* means *verified count* — exactly that many distinct
    non-NULL values.

  This matches the existing convention for {{numNulls}}, {{numTrues}},
  {{numFalses}}.

  was:
h2. Problem

  \{{ColStatistics.countDistinct}} (NDV) overloads the value \{{0}}:

  * *Verified zero* — the column genuinely has zero non-NULL distinct values
    (all-NULL column, empty table).
  * *Unknown* — upstream stats did not compute NDV; \{{0}} leaks through as the
    Thrift-primitive default from \{{ColumnStatisticsObj.numDVs}}.

  Downstream consumers cannot tell the two cases apart, so they apply identical
  fallback heuristics (\{{numRows / 2}}, \{{factor *= 0.5}}, \{{MAX_VALUE}}, 
etc.) to
  both. For *verified zero* the heuristic is wrong (the true answer for
  \{{col = const}} is 0 matching rows), and for *unknown* it merely papers over
  absent information.

  The other count-style fields on \{{ColStatistics}} (\{{numNulls}}, 
\{{numTrues}},
  \{{numFalses}}) already follow the convention "negative = unknown, 0 = 
verified
  zero, positive = verified count" — established by HIVE-29438. 
\{{countDistinct}}
  never got the same treatment.

  h2. Convention after this change

  For \{{ColStatistics.countDistinct}}:

  * *-1* (or any negative value) means *unknown* — NDV was not gathered or
    cannot be determined.
  * *0* means *verified zero* — the column has zero non-NULL distinct values.
  * *positive value* means *verified count* — exactly that many distinct
    non-NULL values.

  This matches the existing convention for \{{numNulls}}, \{{numTrues}},
  \{{numFalses}}.


> Disambiguate ColStatistics.countDistinct "unknown" from "verified zero"
> -----------------------------------------------------------------------
>
>                 Key: HIVE-29625
>                 URL: https://issues.apache.org/jira/browse/HIVE-29625
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Konstantin Bereznyakov
>            Priority: Major
>
> h2. Problem
>   {{ColStatistics.countDistinct}} (NDV) overloads the value {{0}}:
>   * *Verified zero* — the column genuinely has zero non-NULL distinct values
>     (all-NULL column, empty table).
>   * *Unknown* — upstream stats did not compute NDV; {{0}} leaks through as the
>     Thrift-primitive default from {{ColumnStatisticsObj.numDVs}}.
>   Downstream consumers cannot tell the two cases apart, so they apply 
> identical
>   fallback heuristics ({{numRows / 2}}, {{factor *= 0.5}}, {{MAX_VALUE}}, 
> etc.) to
>   both. For *verified zero* the heuristic is wrong (the true answer for
>   {{col = const}} is 0 matching rows), and for *unknown* it merely papers over
>   absent information.
>   The other count-style fields on {{ColStatistics}} ({{numNulls}}, 
> {{numTrues}},
>   {{numFalses}}) already follow the convention "negative = unknown, 0 = 
> verified
>   zero, positive = verified count" — established by HIVE-29438. 
> {{countDistinct}}
>   never got the same treatment.
>   h2. Convention after this change
>   For {{ColStatistics.countDistinct}}:
>   * *-1* (or any negative value) means *unknown* — NDV was not gathered or
>     cannot be determined.
>   * *0* means *verified zero* — the column has zero non-NULL distinct values.
>   * *positive value* means *verified count* — exactly that many distinct
>     non-NULL values.
>   This matches the existing convention for {{numNulls}}, {{numTrues}},
>   {{numFalses}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to