konstantinb commented on PR #6418: URL: https://github.com/apache/hive/pull/6418#issuecomment-4671800984
@zabetak, my apologies for not following up on all your questions/comments of this PR; I was focused on its alternative solution. Taking your suggestion from #discussion_r3233225674 — making the const case `if (cs.isConst()) ndv = 1` — broke 8 .out files this PR never touched: `union7/15/17/19`, `unionDistinct_3`, `tez_union_multiinsert`, `explainanalyze_2`, `load_dyn_part14`. The cause is exactly what you flagged on #6359 when you first saw the field: > Can we avoid this new indicator? If we add a new field it means that we need to keep them up to date in various places where `ColStatistics` are used [...] `isConst` is set at the producers and maintained in the CASE/COALESCE combiner (`PessimisticStatCombiner`), but **not** in the UNION combiner (`Statistics.addToColumnStats`) — that one recomputes `countDistinct` and `numNulls` and leaves `isConst` alone. So a column out of a `UNION` of distinct constants still reports `isConst = true`, and the unconditional `ndv = 1` acts on a stale flag. There's no single place that keeps the flag consistent across every path `ColStatistics` takes — your March objection, made concrete. (Maintaining it in `addToColumnStats` too doesn't rescue this: `result.isConst() && stat.isConst()` — the rule already in the CASE combiner — still classifies a UNION of *distinct* constants as constant, so there's no boolean that's correct here without comparing the underlying values, which `ColStatistics` doesn't carry.) That's also why I'd rather take the rest of these together than one at a time — they're facets of the same `NDV == 0` overload. The boolean-branch ones are locally patchable (and `:832` you rightly flagged as likely unreachable), but a local patch just relocates the ambiguity instead of removing it. The one that can't be patched around is Part B: distinguishing a genuinely-unknown NDV from a verified zero — your funnel_step / device_type case — needs the two states to be *different values*, which a boolean const flag can't supply for a non-constant column. The sentinel separates them once and the rest follow. What does fix them is the other direction from that same #6359 thread: > everything would be simpler if we could use -1 for NDV to declare unknown as it happens for the other stats. This is probably a bigger change to digest so let's not go into this direction for now. That's HIVE-29625 / #6505. It's the larger change you flagged as a lot to digest — ~17 files, more `.q.out` churn — which is why I tried the flag first. But the consistency problem is structural, not a missing case, so I don't think the `isConst` path can be made correct. With the -1 treated as "unknown NDV". there's no flag to sync: `extractNDVGroupingColumns` becomes just `if (ndv < 0) fall back to the heuristic; else +1 for the null bucket`, and your Part B underestimation resolves on its own — unknown columns are `-1` and fall back, genuinely-all-null columns stay `0` and get the `+1`, so both "sources of (NDV=0, numNulls>0)" land right without guessing their frequency. So I'd like to close this PR (and HIVE-29556) as superseded by HIVE-29625 and move review to #6505 — the inevitable version of this rather than one more attempt at the flag, and the direction you pointed to first. Apologies for the detour to get here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
