2010YOUY01 commented on PR #19609: URL: https://github.com/apache/datafusion/pull/19609#issuecomment-3727206400
Let's work through the API design first. After we agree to move forward, I'll apply all minor code review suggestions. > My suggested next step is to Decide if we have the effort / motivation to see it through (I am willing to help review, organized, and build consensus, as I think it is a foundational piece of DataFusion) I do plan to spend significant effort on the future implementation, as I believe this is a very important optimization feature. Thank you for your help, @alamb! > I have a bunch of API suggestions I think we can iterate on and make the code better / cleaner. > > The biggest challenge in this project, in my mind, is that it proposes to adds (yet) another API for some sort of range/statistics propagation. > > Even before this PR, we already have 4 APIs on the PhysicalExpr > > * https://docs.rs/datafusion/latest/datafusion/physical_expr/trait.PhysicalExpr.html#method.evaluate_bounds > * https://docs.rs/datafusion/latest/datafusion/physical_expr/trait.PhysicalExpr.html#method.evaluate_statistics > * https://docs.rs/datafusion/latest/datafusion/physical_expr/trait.PhysicalExpr.html#method.propagate_constraints > * https://docs.rs/datafusion/latest/datafusion/physical_expr/trait.PhysicalExpr.html#method.propagate_statistics > > Some of these methods seem to be unused in the core codebase (e.g. the "propagate" variants, and some of the new V2 statistics API added in #14699 by @Fly-Style, but I don't see any uses of it in the code (https://github.com/apache/datafusion/blob/998f534aafbf55c83daaa6fd4985ba143954b0e0/datafusion/physical-expr/src/statistics/stats_solver.rs#L39-L38). It also has provisions for various statistical distributions for which I still don't understand the usecase > > If we are going to add a new API, I think we should deprecate some/all of the others. I took a quick look at those APIs; their high-level ideas are: - `evaluate_bounds()` propagates bounds using the `Interval` type. - `propagate_constraints()` performs inverse propagation on `Interval`s. For example, given the output range of an expression and the original input ranges for its child expressions, it performs inverse propagation to refine the children’s ranges (its comment includes a specific example). - `evaluate_statistics()` is similar to `evaluate_bounds()`, but operates on `Distribution` inputs to capture richer statistical distribution information. - `propagate_statistics()` is similar to `propagate_constraints()`, but operates on `Distribution` inputs. The API proposed in this PR can replace `evaluate_bounds()`, because it covers the full functionality of `evaluate_bounds()`. The only difference is that it is vectorized; we can use a simple adaptor for conversion, and `evaluate_bounds()` can be deprecated in the future. `propagate_constraints()` cannot be replaced, since the statistics propagation for pruning (this PR) does not require inverse propagation. `evaluate_statistics()` and `propagate_statistics()` cannot be replaced by this PR, because their representations do not overlap. Even more advanced pruning will not need the statistical information inside the `Distribution` struct. I did some archaeology, but I am not very sure. If `evaluate_statistics()` and `propagate_statistics()` are intended to replace `evaluate_bounds()` and `propagate_constraints()`, and the latter two are no longer used inside the DataFusion core, should we simply deprecate them? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
