[
https://issues.apache.org/jira/browse/HIVE-29556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Konstantin Bereznyakov updated HIVE-29556:
------------------------------------------
Description:
h2. Problem
In {{StatsUtils.extractNDVGroupingColumns}}, every grouping column whose
{{ColStatistics}} arrives with the signature {{(NDV == 0 && numNulls > 0)}}
is unconditionally promoted to {{NDV = 1}} via:
{code:java}
if (cs.getNumNulls() > 0) {
ndv = StatsUtils.safeAdd(ndv, 1);
}
{code}
The {{+1}} adjustment exists to account for the NULL bucket — GROUP BY treats
NULL as a
distinct group key in SQL semantics. The intent is correct when {{NDV}}
carries a real
non-zero count: {{K + 1}} properly represents "K real groups + 1 NULL bucket".
The bug is that {{NDV = 0}} is *overloaded* in Hive's stats model. It can
mean either:
# *Unknown NDV* — stats are partial or NDV was not computed by the producer;
{{0}} is the
sentinel for "we don't know".
# *Verified zero* — the column has zero non-null distinct values (e.g.,
all-NULL).
The {{+1}} adjustment yields the correct answer for case (2) — {{1}} is
exactly the count
of distinct groups for an all-NULL column (the NULL bucket). For case (1) it
produces a
*confidently-wrong NDV count*: it converts "we don't know" into "we know
there's exactly
1 group", masking the unknown nature of the stats.
h2. How the under-estimate propagates
{{computeNDVGroupingColumns}} reduces per-column NDVs into {{ndvProduct}}.
With the bogus
{{NDV = 1}}, {{ndvProduct >= 1}}. Downstream consumers gate on {{ndvProduct
== 0}}:
* {{StatsRulesProcFactory}} (GROUP BY stats annotation) falls back to a
{{parentNumRows / 2}} heuristic when {{ndvProduct == 0}}; otherwise it uses
{{ndvProduct}}
directly. With the bug, the fallback never fires.
* {{SetHashGroupByMinReduction}} returns {{null}} (keeps the operator's
default
{{minReductionHashAggr}}) when {{ndvProduct == 0}}; otherwise it computes
{{1 - ndvProduct/numRows}}.
The Case-3 cardinality formula in {{StatsRulesProcFactory}} then becomes:
{code}
cardinality = min(parentNumRows/2, ndvProduct * parallelism)
= min(50M, 1)
= 1
{code}
The GROUP BY operator's row-count annotation in EXPLAIN reads {{Statistics:
Num rows: 1}}.
h2. Cascading consequences
The bogus 1-row estimate propagates downstream and causes CBO to make wrong
operator-selection decisions:
* *Map Join selection* — a 1-row side is always broadcast-eligible. The
optimizer chooses
{{Map Join Operator}} where the actual data would overflow the broadcast
threshold (and
should be a {{Merge Join Operator}}). This is the most visible failure mode
at scale and
risks runtime mapjoin OOM.
* *Reducer parallelism* — too few reducers allocated for what is actually a
large
GROUP BY output.
* *Memory sizing* — undersized hash-aggregation buffers.
* *Join order* — CBO favors plan shapes built on the "small" intermediate,
even if the
actual data isn't small.
h2. Sources of the {{(NDV=0, numNulls>0)}} signature on master
The signature can reach {{extractNDVGroupingColumns}} through multiple paths
in the
current code:
# *Metastore-loaded stats* with NDV not gathered but numNulls populated
(partial ANALYZE;
explicit {{ALTER TABLE UPDATE STATISTICS}}; upstream stats producer that
doesn't compute
NDV).
# *PessimisticStatCombiner output* for conditional expressions ({{CASE WHEN}},
{{COALESCE}}, {{IF}}) where one branch contributes unknown NDV (NDV=0) and
another is a
NULL literal: the combiner's {{max(NDV)}} over {{[0, 0]}} stays at {{0}},
while
{{max(numNulls)}} picks up the NULL literal's {{numNulls = parentNumRows}},
producing the
signature on the combined output column.
(The {{NDV=0}} signature can *also* arise from genuinely verified-zero
sources —
{{buildColStatForConstant}} for NULL literals and {{buildColStatForBoolean}}
for
all-NULL booleans — but those cases are not part of this issue: the {{+1}}
adjustment
yields the semantically-correct answer there.)
h2. Reproduction
Two probes attached: {{groupby_unknown_ndv.q}} (the queries) and
{{groupby_unknown_ndv.q.out}} (master's output, captured against the current
master
branch with {{TestMiniLlapLocalCliDriver}}).
* *U1* — direct GROUP BY on a column with stats explicitly set to {{(numDVs=0,
numNulls=parentRows)}} (source #1).
* *U2* — GROUP BY a {{CASE WHEN}} expression with one NULL-literal branch
combined with
an unknown-NDV column (source #2).
For both probes, master estimates {{Statistics: Num rows: 1}} on the inner
GROUP BY
operators. With {{hive.auto.convert.join=true}}, the bogus 1-row side is
broadcast-eligible
and the optimizer selects {{Map Join Operator}} for the downstream join, even
though the
join is sized for {{Merge Join Operator}}. The {{.q.out}} shows two {{Map
Join Operator}}
instances — exactly the operator-level under-estimation HIVE-29556 targets.
was:The side effect of this is a potentially severe under-estimation of the
number of GROUP BY result rows
> StatsUtils::extractNDVGroupingColumns() adjust "unknown" NDVs of "0" to "1"
> ---------------------------------------------------------------------------
>
> Key: HIVE-29556
> URL: https://issues.apache.org/jira/browse/HIVE-29556
> Project: Hive
> Issue Type: Bug
> Reporter: Konstantin Bereznyakov
> Priority: Major
> Labels: pull-request-available
> Attachments: reproduce_groupby_unknown_ndv.q,
> reproduce_groupby_unknown_ndv.q.out
>
>
> h2. Problem
> In {{StatsUtils.extractNDVGroupingColumns}}, every grouping column whose
> {{ColStatistics}} arrives with the signature {{(NDV == 0 && numNulls > 0)}}
> is unconditionally promoted to {{NDV = 1}} via:
> {code:java}
> if (cs.getNumNulls() > 0) {
> ndv = StatsUtils.safeAdd(ndv, 1);
> }
> {code}
> The {{+1}} adjustment exists to account for the NULL bucket — GROUP BY
> treats NULL as a
> distinct group key in SQL semantics. The intent is correct when {{NDV}}
> carries a real
> non-zero count: {{K + 1}} properly represents "K real groups + 1 NULL
> bucket".
> The bug is that {{NDV = 0}} is *overloaded* in Hive's stats model. It can
> mean either:
> # *Unknown NDV* — stats are partial or NDV was not computed by the
> producer; {{0}} is the
> sentinel for "we don't know".
> # *Verified zero* — the column has zero non-null distinct values (e.g.,
> all-NULL).
> The {{+1}} adjustment yields the correct answer for case (2) — {{1}} is
> exactly the count
> of distinct groups for an all-NULL column (the NULL bucket). For case (1)
> it produces a
> *confidently-wrong NDV count*: it converts "we don't know" into "we know
> there's exactly
> 1 group", masking the unknown nature of the stats.
> h2. How the under-estimate propagates
> {{computeNDVGroupingColumns}} reduces per-column NDVs into {{ndvProduct}}.
> With the bogus
> {{NDV = 1}}, {{ndvProduct >= 1}}. Downstream consumers gate on {{ndvProduct
> == 0}}:
> * {{StatsRulesProcFactory}} (GROUP BY stats annotation) falls back to a
> {{parentNumRows / 2}} heuristic when {{ndvProduct == 0}}; otherwise it uses
> {{ndvProduct}}
> directly. With the bug, the fallback never fires.
> * {{SetHashGroupByMinReduction}} returns {{null}} (keeps the operator's
> default
> {{minReductionHashAggr}}) when {{ndvProduct == 0}}; otherwise it computes
> {{1 - ndvProduct/numRows}}.
> The Case-3 cardinality formula in {{StatsRulesProcFactory}} then becomes:
> {code}
> cardinality = min(parentNumRows/2, ndvProduct * parallelism)
> = min(50M, 1)
> = 1
> {code}
> The GROUP BY operator's row-count annotation in EXPLAIN reads {{Statistics:
> Num rows: 1}}.
> h2. Cascading consequences
> The bogus 1-row estimate propagates downstream and causes CBO to make wrong
> operator-selection decisions:
> * *Map Join selection* — a 1-row side is always broadcast-eligible. The
> optimizer chooses
> {{Map Join Operator}} where the actual data would overflow the broadcast
> threshold (and
> should be a {{Merge Join Operator}}). This is the most visible failure mode
> at scale and
> risks runtime mapjoin OOM.
> * *Reducer parallelism* — too few reducers allocated for what is actually a
> large
> GROUP BY output.
> * *Memory sizing* — undersized hash-aggregation buffers.
> * *Join order* — CBO favors plan shapes built on the "small" intermediate,
> even if the
> actual data isn't small.
> h2. Sources of the {{(NDV=0, numNulls>0)}} signature on master
> The signature can reach {{extractNDVGroupingColumns}} through multiple
> paths in the
> current code:
> # *Metastore-loaded stats* with NDV not gathered but numNulls populated
> (partial ANALYZE;
> explicit {{ALTER TABLE UPDATE STATISTICS}}; upstream stats producer that
> doesn't compute
> NDV).
> # *PessimisticStatCombiner output* for conditional expressions ({{CASE
> WHEN}},
> {{COALESCE}}, {{IF}}) where one branch contributes unknown NDV (NDV=0) and
> another is a
> NULL literal: the combiner's {{max(NDV)}} over {{[0, 0]}} stays at {{0}},
> while
> {{max(numNulls)}} picks up the NULL literal's {{numNulls = parentNumRows}},
> producing the
> signature on the combined output column.
> (The {{NDV=0}} signature can *also* arise from genuinely verified-zero
> sources —
> {{buildColStatForConstant}} for NULL literals and
> {{buildColStatForBoolean}} for
> all-NULL booleans — but those cases are not part of this issue: the {{+1}}
> adjustment
> yields the semantically-correct answer there.)
> h2. Reproduction
> Two probes attached: {{groupby_unknown_ndv.q}} (the queries) and
> {{groupby_unknown_ndv.q.out}} (master's output, captured against the
> current master
> branch with {{TestMiniLlapLocalCliDriver}}).
> * *U1* — direct GROUP BY on a column with stats explicitly set to
> {{(numDVs=0,
> numNulls=parentRows)}} (source #1).
> * *U2* — GROUP BY a {{CASE WHEN}} expression with one NULL-literal branch
> combined with
> an unknown-NDV column (source #2).
> For both probes, master estimates {{Statistics: Num rows: 1}} on the inner
> GROUP BY
> operators. With {{hive.auto.convert.join=true}}, the bogus 1-row side is
> broadcast-eligible
> and the optimizer selects {{Map Join Operator}} for the downstream join,
> even though the
> join is sized for {{Merge Join Operator}}. The {{.q.out}} shows two {{Map
> Join Operator}}
> instances — exactly the operator-level under-estimation HIVE-29556 targets.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)