konstantinb commented on PR #6418:
URL: https://github.com/apache/hive/pull/6418#issuecomment-4361803611

   @zabetak — picking up on our earlier discussion in #6359, I'd like to walk 
through what this revision actually does and how it lines up with the concerns 
you raised there. The PR's primary subject
     (the `extractNDVGroupingColumns` over-confident `+1` for unknown NDV) is 
fixed in a way that resolves the const-NULL edge case you flagged, without an 
architectural refactor.
   
     What was done
   
     A single in-memory boolean field `isConst` is added to `ColStatistics`, 
set at exactly the sites that produce a *verifiably*-zero NDV:
   
       1. `StatsUtils.buildColStatForConstant` — every constant projection 
(NULL or non-NULL literal).
       2. `StatsUtils.getColStatistics` BOOLEAN branch, all-NULL sub-case 
(`numTrues == 0 && numFalses == 0`).
       3. `StatsUtils.getColStatistics` BOOLEAN branch, verified single-value 
sub-case (numNulls == 0 with one of {numTrues, numFalses} verified positive) — 
covers all-TRUE and all-FALSE.
   
     `PessimisticStatCombiner.add` AND-clears the flag across multi-branch 
combinations:
   
         result.setConst(result.isConst() && stat.isConst());
   
     This directly closes the gotcha you flagged on April 2: *"if in the CASE 
statement you have one branch with unknown NDV and another that is the NULL 
constant the result of the combiner will appear to
      other consumers as a NULL constant"*. With the AND-clear, an unknown 
branch combined with a NULL-literal branch produces a result with `isConst = 
false`, so downstream consumers see "unknown" rather
      than "verified const-NULL".
   
     The only reader of the flag is `extractNDVGroupingColumns`, which uses one 
unified condition for the +1 NULL-bucket adjustment:
   
         if (cs.getNumNulls() > 0 && (ndv > 0 || cs.isConst())) {
           ndv = StatsUtils.safeAdd(ndv, 1);
         }
   
     This applies the +1 for verified const-NULL keys (yielding the exact `1` 
group) and skips for unknown-NDV columns (yielding `0` → caller falls back to 
`parentNumRows / 2`).
   
     Impact surface
   
     `ColStatistics` is a plain Java POJO in `org.apache.hadoop.hive.ql.plan` — 
no Thrift inheritance, no `Serializable`, no metastore persistence. The field 
is purely in-memory query-plan state within a
     single optimizer pass. The metastore Thrift schema (`ColumnStatisticsObj`, 
the per-type stats data classes) is unchanged; existing persisted column stats 
are unaffected; cross-engine consumers (Spark
      / Iceberg / Trino bridges) see the same wire format.
   
     Touched sites in production code:
     - `ColStatistics`: new field + getter/setter, propagated in `clone()` and 
dumped in `toString()` — the same housekeeping every other boolean field on 
`ColStatistics` already carries (mirrors
     `isEstimated` / `isFilteredColumn`).
     - `StatsUtils`: 3 producer sites where `setConst(true)` is called — 
`buildColStatForConstant` (constant projections), and two sub-cases of the 
`getColStatistics` BOOLEAN branch (all-NULL + verified
     single-value).
     - `PessimisticStatCombiner.add`: 1 line — the AND-clear rule.
     - `extractNDVGroupingColumns`: 1 read site (the only consumer).
   
     Net flag-aware logic: 4 sites (3 producers + 1 combiner rule) + 1 reader. 
Everything else is one-time bookkeeping that any new boolean field on 
`ColStatistics` carries.
   
     On your March 25 concern
   
     You wrote on March 25: *"If we add a new field it means that we need to 
keep them up to date in various places where ColStatistics are used so I would 
prefer to avoid this if possible."* That concern
      is correct in principle — and worth weighing against the alternative 
below. In practice, with this PR's flag-aware surface bounded to 4 sites + 1 
reader, and the default-safe failure mode (a missing
      `setConst(true)` call leaves the recovery dormant rather than introducing 
wrong answers), the maintenance cost is one-time and finite rather than 
open-ended.
   
     A side benefit for #6359 itself
   
     With `isConst` in place, the `StatEstimator` interface refactor in #6359 
(adding a `numRows` parameter to `estimate(...)`) becomes unnecessary. Two 
motivations for that refactor dissolve:
   
     - The `min(rows, sum NDV)` bound enforcement — your March 25 suggestion: 
*"A GenericUDF can never have an NDV > numRows so instead of putting the 
responsibility to the implementors of StatEstimator
     we could enforce the bound here. This would avoid the changes in the 
existing APIs."* Applying the cap at the call site keeps `estimate(parents)` 
unchanged.
     - The combiner's need to detect const-NULL via signature (which required 
parent rows internally to check `numNulls == numRows`) — closed by `isConst` 
being explicit in `ColStatistics`. The combiner
     reads the flag directly; no signature inference, no parent-row plumbing 
needed.
   
     Net effect: #6359 can drop the API change and stay focused on its stated 
goal (the combiner improvement) — directly addressing your March 25 *"There is 
no reason to add a default implementation for
     this method. Leaving things as they are does not break anything."*
   
     Alternative considered
   
     The architecturally complete fix is the one you yourself suggested on 
March 26: *"everything would be simpler if we could use -1 for NDV to declare 
unknown as it happens for the other stats. This is
     probably a bigger change to digest so let's not go into this direction for 
now."* I agree it's the right long-term answer; concretely the cost is:
   
     - ~20+ read sites of `ColStatistics.getCountDistint()` across 
`optimizer/`, `stats/`, etc. — most check `<= 0` or `== 0` under the implicit 
"0 means unknown" assumption and would each need
     case-by-case re-evaluation;
     - the metastore Thrift schema's per-type stats data (`getNumDVs()` 
returning `long` with implicit `0` default) — definitionally tangles "verified 
zero" with "unset";
     - backward compatibility for already-persisted column stats, where 
existing `0` values become ambiguous between "verified zero" and "old/unset";
     - cross-engine coordination (Spark / Iceberg / Trino consume Hive's column 
stats — semantic shift in NDV=0 is a wire-compatibility break);
     - regenerating likely hundreds of `.q.out` goldens that print NDV in 
EXPLAIN annotations.
   
     vs `isConst`'s 4 flag-aware sites / 1 reader / no schema or wire change.
   
     I think this revision is the right balance for the cost/benefit ratio, and 
on March 26 in the same thread you indicated you'd accept this approach: *"I am 
OK to revert to your original proposal of
     adding a new boolean field in ColStatistics object for marking constants. 
Leaving the final choice to you :)"* Happy to discuss further if any of the 
above raises concerns.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to