asolimando opened a new issue, #20966:
URL: https://github.com/apache/datafusion/issues/20966

   ### Is your feature request related to a problem or challenge?
   
   The overlap-based NDV merge formula (`estimate_ndv_with_overlap`) introduced 
in #19957 (and shared with Union from #20846) is not associative: merging 
`(A+B)+C` produces a different result than `A+(B+C)` because the intermediate 
merge updates min/max/NDV, which "smears" the column statistics before the next 
pairwise merge.
   
   Example:
   ```
   A = [0,100], NDV=80
   B = [50,150], NDV=60
   C = [100,200], NDV=50
   
   (A+B)+C = 135
   A+(B+C) = 137
   ```
   
   Current usage in `try_merge`, as addressed in #19957, folds left-to-right 
over row groups, so the result depends on row group ordering. Practically, row 
group order within a Parquet file is stable, so results are deterministic for a 
given file. The difference is typically small (bounded by the 
uniform-distribution assumption).
   
   ### Describe the solution you'd like
   
   A multi-way merge that computes the overlap formula over all inputs at once 
rather than pairwise, using the original (unsmeared) min/max/NDV from each 
input.
   
   ### Describe alternatives you've considered
   
   Store HLL sketches per row group/column in Parquet and merge them for exact 
NDV computation, bypassing the scalar heuristic entirely. Parquet does not 
natively support this, but [apache/arrow-rs#8608 
(comment)](https://github.com/apache/arrow-rs/issues/8608#issuecomment-4041324952)
 proposes encoding distinct-count estimates in user-defined key=value metadata, 
which could serve as a vehicle for HLL sketches. Citing as an alternative for 
completeness, not currently viable as not implemented yet.
   
   
   ### Additional context
   
   - Review comment: 
https://github.com/apache/datafusion/pull/19957#discussion_r2936704265 
(@gene-bordegaray)
   - Related PR: #20846 (Union NDV overlap formula)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to