konstantinb commented on PR #6356:
URL: https://github.com/apache/hive/pull/6356#issuecomment-4426060400
Hi @zabetak — this is a companion to #6418, addressing the same NDV=0
"unknown stats" problem but in the join cardinality estimator rather than GROUP
BY.
The bug: when join keys have NDV=0 on both sides (common for binary
columns, date/timestamp columns without populated NDV, or tables with stale
stats),
`getDenominator` returns 0, which `computeRowCountAssumingInnerJoin`
replaces with 1. The join formula then becomes:
result = otherSideRows × (maxRows / 1)
For two 100M-row tables, that's 100M × 100M = 10^16 — a full cartesian
product estimate for an equi-join. This cascades into downstream operators
(aggregations,
subsequent joins) and typically forces suboptimal plans by making the
join output appear astronomically larger than it actually is.
The fix intercepts after PK-FK inference fails but before the NDV-based
denominator path, and applies the existing `hive.stats.join.factor` heuristic
(default
1.1× the largest input). This is the same conservative estimate already
used in the "no column statistics at all" branch — just triggered earlier when
we can
detect that NDV=0 makes the denominator meaningless.
Would you be willing to take a look when you have time? Happy to provide
additional context or adjust the approach. Thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]