konstantinb commented on PR #6418:
URL: https://github.com/apache/hive/pull/6418#issuecomment-4375372923

   > @konstantinb Thanks for the detailed answer.
   > 
   > NULL is not an actual value so not sure if it should be reflected in the 
NDV. In other words, maybe we should not adjust NDV at all independent if it is 
constant or not. I think we have discussed this at some point but don't 
remember if we concluded somewhere. Have you explored this alternative?
   
    @zabetak — sorry for not following up sooner on [your 
question](https://github.com/apache/hive/pull/6359#discussion_r3026760188) in 
#6359. You raised the possibility of dropping the +1 for the NDV > 0 case 
entirely rather than narrowing it. I wanted to show a concrete case where that 
matters before going that route.
   
     The +1 dates to 
[HIVE-5369](https://issues.apache.org/jira/browse/HIVE-5369) (Prasanth 
Jayachandran, 2013), introduced with the comment "if NULLs exists, add 1 to 
distinct count":
   
     long dvs = cs.getCountDistint();
     // if NULLs exists, add 1 to distinct count
     if (cs.getNumNulls() > 0) {
         dvs += 1;
     }
   
     The reason is GROUP BY semantics: a nullable column with NDV > 0 produces 
one extra output group for NULL rows. Dropping the +1 means that extra group 
disappears from the estimate.
   
     I added ql/src/test/queries/clientpositive/groupby_null_bucket.q to make 
this concrete. The scenario: two columns each with NDV=1 but a large null 
fraction (the pattern you get from LEFT JOINs with many unmatched rows), 
grouped alongside a high-cardinality key:
   
     ┌─────────────┬─────────┬─────────────┐
     │   Column    │   NDV   │  numNulls   │
     ├─────────────┼─────────┼─────────────┤
     │ product_id  │ 100,000 │ 0           │
     ├─────────────┼─────────┼─────────────┤
     │ funnel_step │ 1       │ 480,000,000 │
     ├─────────────┼─────────┼─────────────┤
     │ device_type │ 1       │ 450,000,000 │
     └─────────────┴─────────┴─────────────┘
   
     With +1: NDV product = 100K × 2 × 2 = 400K → 13 MB → above 10 MB map join 
threshold → Merge Join
     Without +1: NDV product = 100K × 1 × 1 = 100K → 3.3 MB → below threshold → 
Map Join (4× underestimate, OOM risk on the actual ~400K-row result)
   
     What would be removed:
   
       long ndv = cs.getCountDistint();
     - if (cs.getNumNulls() > 0) {
     -   ndv = StatsUtils.safeAdd(ndv, 1);
     - }
       ndvValues.add(ndv);
   
     Plan change (key excerpts):
   
     With +1:
     Edges:
       Reducer 2 <- Map 1 (SIMPLE_EDGE)
       Reducer 3 <- Map 4 (SIMPLE_EDGE), Reducer 2 (SIMPLE_EDGE)
     ...
     Group By Operator [mergepartial]
       Statistics: Num rows: 400000 Data size: 13064238
     ...
     Merge Join Operator
   
     Without +1:
     Edges:
       Map 3 <- Reducer 2 (BROADCAST_EDGE)
       Reducer 2 <- Map 1 (SIMPLE_EDGE)
     ...
     Group By Operator [mergepartial]
       Statistics: Num rows: 100000 Data size: 3266238
     ...
     Map Join Operator
     
   
[groupby_null_bucket.q.txt](https://github.com/user-attachments/files/27379793/groupby_null_bucket.q.txt)
   
[groupby_null_bucket.q.out.WITHOUT_PLUS1.txt](https://github.com/user-attachments/files/27379804/groupby_null_bucket.q.out.WITHOUT_PLUS1.txt)
   
[groupby_null_bucket.q.out.WITH_PLUS1.txt](https://github.com/user-attachments/files/27379805/groupby_null_bucket.q.out.WITH_PLUS1.txt)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to