konstantinb commented on PR #6418:
URL: https://github.com/apache/hive/pull/6418#issuecomment-4375372923
> @konstantinb Thanks for the detailed answer.
>
> NULL is not an actual value so not sure if it should be reflected in the
NDV. In other words, maybe we should not adjust NDV at all independent if it is
constant or not. I think we have discussed this at some point but don't
remember if we concluded somewhere. Have you explored this alternative?
@zabetak — sorry for not following up sooner on [your
question](https://github.com/apache/hive/pull/6359#discussion_r3026760188) in
#6359. You raised the possibility of dropping the +1 for the NDV > 0 case
entirely rather than narrowing it. I wanted to show a concrete case where that
matters before going that route.
The +1 dates to
[HIVE-5369](https://issues.apache.org/jira/browse/HIVE-5369) (Prasanth
Jayachandran, 2013), introduced with the comment "if NULLs exists, add 1 to
distinct count":
long dvs = cs.getCountDistint();
// if NULLs exists, add 1 to distinct count
if (cs.getNumNulls() > 0) {
dvs += 1;
}
The reason is GROUP BY semantics: a nullable column with NDV > 0 produces
one extra output group for NULL rows. Dropping the +1 means that extra group
disappears from the estimate.
I added ql/src/test/queries/clientpositive/groupby_null_bucket.q to make
this concrete. The scenario: two columns each with NDV=1 but a large null
fraction (the pattern you get from LEFT JOINs with many unmatched rows),
grouped alongside a high-cardinality key:
┌─────────────┬─────────┬─────────────┐
│ Column │ NDV │ numNulls │
├─────────────┼─────────┼─────────────┤
│ product_id │ 100,000 │ 0 │
├─────────────┼─────────┼─────────────┤
│ funnel_step │ 1 │ 480,000,000 │
├─────────────┼─────────┼─────────────┤
│ device_type │ 1 │ 450,000,000 │
└─────────────┴─────────┴─────────────┘
With +1: NDV product = 100K × 2 × 2 = 400K → 13 MB → above 10 MB map join
threshold → Merge Join
Without +1: NDV product = 100K × 1 × 1 = 100K → 3.3 MB → below threshold →
Map Join (4× underestimate, OOM risk on the actual ~400K-row result)
What would be removed:
long ndv = cs.getCountDistint();
- if (cs.getNumNulls() > 0) {
- ndv = StatsUtils.safeAdd(ndv, 1);
- }
ndvValues.add(ndv);
Plan change (key excerpts):
With +1:
Edges:
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Reducer 3 <- Map 4 (SIMPLE_EDGE), Reducer 2 (SIMPLE_EDGE)
...
Group By Operator [mergepartial]
Statistics: Num rows: 400000 Data size: 13064238
...
Merge Join Operator
Without +1:
Edges:
Map 3 <- Reducer 2 (BROADCAST_EDGE)
Reducer 2 <- Map 1 (SIMPLE_EDGE)
...
Group By Operator [mergepartial]
Statistics: Num rows: 100000 Data size: 3266238
...
Map Join Operator
[groupby_null_bucket.q.txt](https://github.com/user-attachments/files/27379793/groupby_null_bucket.q.txt)
[groupby_null_bucket.q.out.WITHOUT_PLUS1.txt](https://github.com/user-attachments/files/27379804/groupby_null_bucket.q.out.WITHOUT_PLUS1.txt)
[groupby_null_bucket.q.out.WITH_PLUS1.txt](https://github.com/user-attachments/files/27379805/groupby_null_bucket.q.out.WITH_PLUS1.txt)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]