2010YOUY01 commented on PR #19609:
URL: https://github.com/apache/datafusion/pull/19609#issuecomment-3727206400

   Let's work through the API design first. After we agree to move forward, 
I'll apply all minor code review suggestions.
   
   > My suggested next step is to Decide if we have the effort / motivation to 
see it through (I am willing to help review, organized, and build consensus, as 
I think it is a foundational piece of DataFusion)
   
   I do plan to spend significant effort on the future implementation, as I 
believe this is a very important optimization feature. Thank you for your help, 
@alamb!
   
   > I have a bunch of API suggestions I think we can iterate on and make the 
code better / cleaner.
   > 
   > The biggest challenge in this project, in my mind, is that it proposes to 
adds (yet) another API for some sort of range/statistics propagation.
   > 
   > Even before this PR, we already have 4 APIs on the PhysicalExpr
   > 
   > * 
https://docs.rs/datafusion/latest/datafusion/physical_expr/trait.PhysicalExpr.html#method.evaluate_bounds
   > * 
https://docs.rs/datafusion/latest/datafusion/physical_expr/trait.PhysicalExpr.html#method.evaluate_statistics
   > * 
https://docs.rs/datafusion/latest/datafusion/physical_expr/trait.PhysicalExpr.html#method.propagate_constraints
   > * 
https://docs.rs/datafusion/latest/datafusion/physical_expr/trait.PhysicalExpr.html#method.propagate_statistics
   > 
   > Some of these methods seem to be unused in the core codebase (e.g. the 
"propagate" variants, and some of the new V2 statistics API added in #14699 by 
@Fly-Style, but I don't see any uses of it in the code 
(https://github.com/apache/datafusion/blob/998f534aafbf55c83daaa6fd4985ba143954b0e0/datafusion/physical-expr/src/statistics/stats_solver.rs#L39-L38).
 It also has provisions for various statistical distributions for which I still 
don't understand the usecase
   > 
   > If we are going to add a new API, I think we should deprecate some/all of 
the others.
   
   I took a quick look at those APIs; their high-level ideas are:
   
   - `evaluate_bounds()` propagates bounds using the `Interval` type.
   - `propagate_constraints()` performs inverse propagation on `Interval`s. For 
example, given the output range of an expression and the original input ranges 
for its child expressions, it performs inverse propagation to refine the 
children’s ranges (its comment includes a specific example).
   - `evaluate_statistics()` is similar to `evaluate_bounds()`, but operates on 
`Distribution` inputs to capture richer statistical distribution information.
   - `propagate_statistics()` is similar to `propagate_constraints()`, but 
operates on `Distribution` inputs.
   
   The API proposed in this PR can replace `evaluate_bounds()`, because it 
covers the full functionality of `evaluate_bounds()`. The only difference is 
that it is vectorized; we can use a simple adaptor for conversion, and 
`evaluate_bounds()` can be deprecated in the future.
   
   `propagate_constraints()` cannot be replaced, since the statistics 
propagation for pruning (this PR) does not require inverse propagation.
   
   `evaluate_statistics()` and `propagate_statistics()` cannot be replaced by 
this PR, because their representations do not overlap. Even more advanced 
pruning will not need the statistical information inside the `Distribution` 
struct.
   
   I did some archaeology, but I am not very sure. If `evaluate_statistics()` 
and `propagate_statistics()` are intended to replace `evaluate_bounds()` and 
`propagate_constraints()`, and the latter two are no longer used inside the 
DataFusion core, should we simply deprecate them?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to