konstantinb commented on PR #6356:
URL: https://github.com/apache/hive/pull/6356#issuecomment-4426060400

   Hi @zabetak — this is a companion to #6418, addressing the same NDV=0 
"unknown stats" problem but in the join cardinality estimator rather than GROUP 
BY.
   
     The bug: when join keys have NDV=0 on both sides (common for binary 
columns, date/timestamp columns without populated NDV, or tables with stale 
stats),
     `getDenominator` returns 0, which `computeRowCountAssumingInnerJoin` 
replaces with 1. The join formula then becomes:
   
         result = otherSideRows × (maxRows / 1)
   
     For two 100M-row tables, that's 100M × 100M = 10^16 — a full cartesian 
product estimate for an equi-join. This cascades into downstream operators 
(aggregations,
      subsequent joins) and typically forces suboptimal plans by making the 
join output appear astronomically larger than it actually is.
   
     The fix intercepts after PK-FK inference fails but before the NDV-based 
denominator path, and applies the existing `hive.stats.join.factor` heuristic 
(default
     1.1× the largest input). This is the same conservative estimate already 
used in the "no column statistics at all" branch — just triggered earlier when 
we can
     detect that NDV=0 makes the denominator meaningless.
   
     Would you be willing to take a look when you have time? Happy to provide 
additional context or adjust the approach. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to