[ 
https://issues.apache.org/jira/browse/HIVE-29556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Bereznyakov updated HIVE-29556:
------------------------------------------
    Description: 
h2. Problem

  In {{StatsUtils.extractNDVGroupingColumns}}, every grouping column whose
  {{ColStatistics}} arrives with the signature {{(NDV == 0 && numNulls > 0)}} 
is unconditionally promoted to {{NDV = 1}} via:

  {code:java}
  if (cs.getNumNulls() > 0) {
    ndv = StatsUtils.safeAdd(ndv, 1);
  }
  {code}

  The {{+1}} adjustment exists to account for the NULL bucket — GROUP BY treats 
NULL as a
  distinct group key in SQL semantics. The intent is correct when {{NDV}} 
carries a real
  non-zero count: {{K + 1}} properly represents "K real groups + 1 NULL bucket".

  The bug is that {{NDV = 0}} is *overloaded* in Hive's stats model. It can 
mean either:

  # *Unknown NDV* — stats are partial or NDV was not computed by the producer; 
{{0}} is the
  sentinel for "we don't know".
  # *Verified zero* — the column has zero non-null distinct values (e.g., 
all-NULL).

  The {{+1}} adjustment yields the correct answer for case (2) — {{1}} is 
exactly the count
  of distinct groups for an all-NULL column (the NULL bucket). For case (1) it 
produces a
  *confidently-wrong NDV count*: it converts "we don't know" into "we know 
there's exactly
  1 group", masking the unknown nature of the stats.

  h2. How the under-estimate propagates

  {{computeNDVGroupingColumns}} reduces per-column NDVs into {{ndvProduct}}. 
With the bogus
  {{NDV = 1}}, {{ndvProduct >= 1}}. Downstream consumers gate on {{ndvProduct 
== 0}}:

  * {{StatsRulesProcFactory}} (GROUP BY stats annotation) falls back to a
  {{parentNumRows / 2}} heuristic when {{ndvProduct == 0}}; otherwise it uses 
{{ndvProduct}}
  directly. With the bug, the fallback never fires.
  * {{SetHashGroupByMinReduction}} returns {{null}} (keeps the operator's 
default
  {{minReductionHashAggr}}) when {{ndvProduct == 0}}; otherwise it computes
  {{1 - ndvProduct/numRows}}.

  The Case-3 cardinality formula in {{StatsRulesProcFactory}} then becomes:

  {code}
  cardinality = min(parentNumRows/2, ndvProduct * parallelism)
              = min(50M, 1)
              = 1
  {code}

  The GROUP BY operator's row-count annotation in EXPLAIN reads {{Statistics: 
Num rows: 1}}.
  h2. Cascading consequences

  The bogus 1-row estimate propagates downstream and causes CBO to make wrong
  operator-selection decisions:

  * *Map Join selection* — a 1-row side is always broadcast-eligible. The 
optimizer chooses
  {{Map Join Operator}} where the actual data would overflow the broadcast 
threshold (and
  should be a {{Merge Join Operator}}). This is the most visible failure mode 
at scale and
  risks runtime mapjoin OOM.
  * *Reducer parallelism* — too few reducers allocated for what is actually a 
large
  GROUP BY output.
  * *Memory sizing* — undersized hash-aggregation buffers.
  * *Join order* — CBO favors plan shapes built on the "small" intermediate, 
even if the
  actual data isn't small.

  h2. Sources of the {{(NDV=0, numNulls>0)}} signature on master

  The signature can reach {{extractNDVGroupingColumns}} through multiple paths 
in the
  current code:

  # *Metastore-loaded stats* with NDV not gathered but numNulls populated 
(partial ANALYZE;
  explicit {{ALTER TABLE UPDATE STATISTICS}}; upstream stats producer that 
doesn't compute
  NDV).
  # *PessimisticStatCombiner output* for conditional expressions ({{CASE WHEN}},
  {{COALESCE}}, {{IF}}) where one branch contributes unknown NDV (NDV=0) and 
another is a
  NULL literal: the combiner's {{max(NDV)}} over {{[0, 0]}} stays at {{0}}, 
while
  {{max(numNulls)}} picks up the NULL literal's {{numNulls = parentNumRows}}, 
producing the
  signature on the combined output column.

  (The {{NDV=0}} signature can *also* arise from genuinely verified-zero 
sources —
  {{buildColStatForConstant}} for NULL literals and {{buildColStatForBoolean}} 
for
  all-NULL booleans — but those cases are not part of this issue: the {{+1}} 
adjustment
  yields the semantically-correct answer there.)

  h2. Reproduction

  Two probes attached: {{groupby_unknown_ndv.q}} (the queries) and
  {{groupby_unknown_ndv.q.out}} (master's output, captured against the current 
master
  branch with {{TestMiniLlapLocalCliDriver}}).

  * *U1* — direct GROUP BY on a column with stats explicitly set to {{(numDVs=0,
  numNulls=parentRows)}} (source #1).
  * *U2* — GROUP BY a {{CASE WHEN}} expression with one NULL-literal branch 
combined with
  an unknown-NDV column (source #2).

  For both probes, master estimates {{Statistics: Num rows: 1}} on the inner 
GROUP BY
  operators. With {{hive.auto.convert.join=true}}, the bogus 1-row side is 
broadcast-eligible
  and the optimizer selects {{Map Join Operator}} for the downstream join, even 
though the
  join is sized for {{Merge Join Operator}}. The {{.q.out}} shows two {{Map 
Join Operator}}
  instances — exactly the operator-level under-estimation HIVE-29556 targets.

  was:The side effect of this is a potentially severe under-estimation of the 
number of GROUP BY result rows


> StatsUtils::extractNDVGroupingColumns() adjust "unknown" NDVs of "0" to "1"
> ---------------------------------------------------------------------------
>
>                 Key: HIVE-29556
>                 URL: https://issues.apache.org/jira/browse/HIVE-29556
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Konstantin Bereznyakov
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: reproduce_groupby_unknown_ndv.q, 
> reproduce_groupby_unknown_ndv.q.out
>
>
> h2. Problem
>   In {{StatsUtils.extractNDVGroupingColumns}}, every grouping column whose
>   {{ColStatistics}} arrives with the signature {{(NDV == 0 && numNulls > 0)}} 
> is unconditionally promoted to {{NDV = 1}} via:
>   {code:java}
>   if (cs.getNumNulls() > 0) {
>     ndv = StatsUtils.safeAdd(ndv, 1);
>   }
>   {code}
>   The {{+1}} adjustment exists to account for the NULL bucket — GROUP BY 
> treats NULL as a
>   distinct group key in SQL semantics. The intent is correct when {{NDV}} 
> carries a real
>   non-zero count: {{K + 1}} properly represents "K real groups + 1 NULL 
> bucket".
>   The bug is that {{NDV = 0}} is *overloaded* in Hive's stats model. It can 
> mean either:
>   # *Unknown NDV* — stats are partial or NDV was not computed by the 
> producer; {{0}} is the
>   sentinel for "we don't know".
>   # *Verified zero* — the column has zero non-null distinct values (e.g., 
> all-NULL).
>   The {{+1}} adjustment yields the correct answer for case (2) — {{1}} is 
> exactly the count
>   of distinct groups for an all-NULL column (the NULL bucket). For case (1) 
> it produces a
>   *confidently-wrong NDV count*: it converts "we don't know" into "we know 
> there's exactly
>   1 group", masking the unknown nature of the stats.
>   h2. How the under-estimate propagates
>   {{computeNDVGroupingColumns}} reduces per-column NDVs into {{ndvProduct}}. 
> With the bogus
>   {{NDV = 1}}, {{ndvProduct >= 1}}. Downstream consumers gate on {{ndvProduct 
> == 0}}:
>   * {{StatsRulesProcFactory}} (GROUP BY stats annotation) falls back to a
>   {{parentNumRows / 2}} heuristic when {{ndvProduct == 0}}; otherwise it uses 
> {{ndvProduct}}
>   directly. With the bug, the fallback never fires.
>   * {{SetHashGroupByMinReduction}} returns {{null}} (keeps the operator's 
> default
>   {{minReductionHashAggr}}) when {{ndvProduct == 0}}; otherwise it computes
>   {{1 - ndvProduct/numRows}}.
>   The Case-3 cardinality formula in {{StatsRulesProcFactory}} then becomes:
>   {code}
>   cardinality = min(parentNumRows/2, ndvProduct * parallelism)
>               = min(50M, 1)
>               = 1
>   {code}
>   The GROUP BY operator's row-count annotation in EXPLAIN reads {{Statistics: 
> Num rows: 1}}.
>   h2. Cascading consequences
>   The bogus 1-row estimate propagates downstream and causes CBO to make wrong
>   operator-selection decisions:
>   * *Map Join selection* — a 1-row side is always broadcast-eligible. The 
> optimizer chooses
>   {{Map Join Operator}} where the actual data would overflow the broadcast 
> threshold (and
>   should be a {{Merge Join Operator}}). This is the most visible failure mode 
> at scale and
>   risks runtime mapjoin OOM.
>   * *Reducer parallelism* — too few reducers allocated for what is actually a 
> large
>   GROUP BY output.
>   * *Memory sizing* — undersized hash-aggregation buffers.
>   * *Join order* — CBO favors plan shapes built on the "small" intermediate, 
> even if the
>   actual data isn't small.
>   h2. Sources of the {{(NDV=0, numNulls>0)}} signature on master
>   The signature can reach {{extractNDVGroupingColumns}} through multiple 
> paths in the
>   current code:
>   # *Metastore-loaded stats* with NDV not gathered but numNulls populated 
> (partial ANALYZE;
>   explicit {{ALTER TABLE UPDATE STATISTICS}}; upstream stats producer that 
> doesn't compute
>   NDV).
>   # *PessimisticStatCombiner output* for conditional expressions ({{CASE 
> WHEN}},
>   {{COALESCE}}, {{IF}}) where one branch contributes unknown NDV (NDV=0) and 
> another is a
>   NULL literal: the combiner's {{max(NDV)}} over {{[0, 0]}} stays at {{0}}, 
> while
>   {{max(numNulls)}} picks up the NULL literal's {{numNulls = parentNumRows}}, 
> producing the
>   signature on the combined output column.
>   (The {{NDV=0}} signature can *also* arise from genuinely verified-zero 
> sources —
>   {{buildColStatForConstant}} for NULL literals and 
> {{buildColStatForBoolean}} for
>   all-NULL booleans — but those cases are not part of this issue: the {{+1}} 
> adjustment
>   yields the semantically-correct answer there.)
>   h2. Reproduction
>   Two probes attached: {{groupby_unknown_ndv.q}} (the queries) and
>   {{groupby_unknown_ndv.q.out}} (master's output, captured against the 
> current master
>   branch with {{TestMiniLlapLocalCliDriver}}).
>   * *U1* — direct GROUP BY on a column with stats explicitly set to 
> {{(numDVs=0,
>   numNulls=parentRows)}} (source #1).
>   * *U2* — GROUP BY a {{CASE WHEN}} expression with one NULL-literal branch 
> combined with
>   an unknown-NDV column (source #2).
>   For both probes, master estimates {{Statistics: Num rows: 1}} on the inner 
> GROUP BY
>   operators. With {{hive.auto.convert.join=true}}, the bogus 1-row side is 
> broadcast-eligible
>   and the optimizer selects {{Map Join Operator}} for the downstream join, 
> even though the
>   join is sized for {{Merge Join Operator}}. The {{.q.out}} shows two {{Map 
> Join Operator}}
>   instances — exactly the operator-level under-estimation HIVE-29556 targets.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to