konstantinb commented on PR #6418:
URL: https://github.com/apache/hive/pull/6418#issuecomment-4361803611
@zabetak — picking up on our earlier discussion in #6359, I'd like to walk
through what this revision actually does and how it lines up with the concerns
you raised there. The PR's primary subject
(the `extractNDVGroupingColumns` over-confident `+1` for unknown NDV) is
fixed in a way that resolves the const-NULL edge case you flagged, without an
architectural refactor.
What was done
A single in-memory boolean field `isConst` is added to `ColStatistics`,
set at exactly the sites that produce a *verifiably*-zero NDV:
1. `StatsUtils.buildColStatForConstant` — every constant projection
(NULL or non-NULL literal).
2. `StatsUtils.getColStatistics` BOOLEAN branch, all-NULL sub-case
(`numTrues == 0 && numFalses == 0`).
3. `StatsUtils.getColStatistics` BOOLEAN branch, verified single-value
sub-case (numNulls == 0 with one of {numTrues, numFalses} verified positive) —
covers all-TRUE and all-FALSE.
`PessimisticStatCombiner.add` AND-clears the flag across multi-branch
combinations:
result.setConst(result.isConst() && stat.isConst());
This directly closes the gotcha you flagged on April 2: *"if in the CASE
statement you have one branch with unknown NDV and another that is the NULL
constant the result of the combiner will appear to
other consumers as a NULL constant"*. With the AND-clear, an unknown
branch combined with a NULL-literal branch produces a result with `isConst =
false`, so downstream consumers see "unknown" rather
than "verified const-NULL".
The only reader of the flag is `extractNDVGroupingColumns`, which uses one
unified condition for the +1 NULL-bucket adjustment:
if (cs.getNumNulls() > 0 && (ndv > 0 || cs.isConst())) {
ndv = StatsUtils.safeAdd(ndv, 1);
}
This applies the +1 for verified const-NULL keys (yielding the exact `1`
group) and skips for unknown-NDV columns (yielding `0` → caller falls back to
`parentNumRows / 2`).
Impact surface
`ColStatistics` is a plain Java POJO in `org.apache.hadoop.hive.ql.plan` —
no Thrift inheritance, no `Serializable`, no metastore persistence. The field
is purely in-memory query-plan state within a
single optimizer pass. The metastore Thrift schema (`ColumnStatisticsObj`,
the per-type stats data classes) is unchanged; existing persisted column stats
are unaffected; cross-engine consumers (Spark
/ Iceberg / Trino bridges) see the same wire format.
Touched sites in production code:
- `ColStatistics`: new field + getter/setter, propagated in `clone()` and
dumped in `toString()` — the same housekeeping every other boolean field on
`ColStatistics` already carries (mirrors
`isEstimated` / `isFilteredColumn`).
- `StatsUtils`: 3 producer sites where `setConst(true)` is called —
`buildColStatForConstant` (constant projections), and two sub-cases of the
`getColStatistics` BOOLEAN branch (all-NULL + verified
single-value).
- `PessimisticStatCombiner.add`: 1 line — the AND-clear rule.
- `extractNDVGroupingColumns`: 1 read site (the only consumer).
Net flag-aware logic: 4 sites (3 producers + 1 combiner rule) + 1 reader.
Everything else is one-time bookkeeping that any new boolean field on
`ColStatistics` carries.
On your March 25 concern
You wrote on March 25: *"If we add a new field it means that we need to
keep them up to date in various places where ColStatistics are used so I would
prefer to avoid this if possible."* That concern
is correct in principle — and worth weighing against the alternative
below. In practice, with this PR's flag-aware surface bounded to 4 sites + 1
reader, and the default-safe failure mode (a missing
`setConst(true)` call leaves the recovery dormant rather than introducing
wrong answers), the maintenance cost is one-time and finite rather than
open-ended.
A side benefit for #6359 itself
With `isConst` in place, the `StatEstimator` interface refactor in #6359
(adding a `numRows` parameter to `estimate(...)`) becomes unnecessary. Two
motivations for that refactor dissolve:
- The `min(rows, sum NDV)` bound enforcement — your March 25 suggestion:
*"A GenericUDF can never have an NDV > numRows so instead of putting the
responsibility to the implementors of StatEstimator
we could enforce the bound here. This would avoid the changes in the
existing APIs."* Applying the cap at the call site keeps `estimate(parents)`
unchanged.
- The combiner's need to detect const-NULL via signature (which required
parent rows internally to check `numNulls == numRows`) — closed by `isConst`
being explicit in `ColStatistics`. The combiner
reads the flag directly; no signature inference, no parent-row plumbing
needed.
Net effect: #6359 can drop the API change and stay focused on its stated
goal (the combiner improvement) — directly addressing your March 25 *"There is
no reason to add a default implementation for
this method. Leaving things as they are does not break anything."*
Alternative considered
The architecturally complete fix is the one you yourself suggested on
March 26: *"everything would be simpler if we could use -1 for NDV to declare
unknown as it happens for the other stats. This is
probably a bigger change to digest so let's not go into this direction for
now."* I agree it's the right long-term answer; concretely the cost is:
- ~20+ read sites of `ColStatistics.getCountDistint()` across
`optimizer/`, `stats/`, etc. — most check `<= 0` or `== 0` under the implicit
"0 means unknown" assumption and would each need
case-by-case re-evaluation;
- the metastore Thrift schema's per-type stats data (`getNumDVs()`
returning `long` with implicit `0` default) — definitionally tangles "verified
zero" with "unset";
- backward compatibility for already-persisted column stats, where
existing `0` values become ambiguous between "verified zero" and "old/unset";
- cross-engine coordination (Spark / Iceberg / Trino consume Hive's column
stats — semantic shift in NDV=0 is a wire-compatibility break);
- regenerating likely hundreds of `.q.out` goldens that print NDV in
EXPLAIN annotations.
vs `isConst`'s 4 flag-aware sites / 1 reader / no schema or wire change.
I think this revision is the right balance for the cost/benefit ratio, and
on March 26 in the same thread you indicated you'd accept this approach: *"I am
OK to revert to your original proposal of
adding a new boolean field in ColStatistics object for marking constants.
Leaving the final choice to you :)"* Happy to discuss further if any of the
above raises concerns.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]