asolimando opened a new pull request, #21122: URL: https://github.com/apache/datafusion/pull/21122
## Which issue does this PR close? Part of #21120 (framework + projection/filter integration) ## Rationale for this change DataFusion currently loses expression-level statistics when computing plan metadata. Projected expressions that aren't bare columns or literals get unknown statistics, and filter selectivity falls back to a hardcoded 20% when interval analysis cannot handle the predicate. There is also no extension point for users to provide statistics for their own UDFs. This PR introduces ExpressionAnalyzer, a pluggable chain-of-responsibility framework that addresses these gaps. It follows the same extensibility pattern used elsewhere in DataFusion (ExprPlanner, OptimizerRule). Addresses reviewer feedback from #19957: chain delegation, SessionState integration, own folder. ## What changes are included in this PR? - `ExpressionAnalyzer` trait with `registry` parameter for chain delegation - `ExpressionAnalyzerRegistry` to chain analyzers (first `Computed` wins) - `DefaultExpressionAnalyzer`: Selinger-style estimation for columns, literals, binary expressions (AND/OR/NOT/comparisons), arithmetic - `ExpressionAnalyzerRegistry` stored in `SessionState`, injected into `ProjectionExec` and `FilterExec` by the planner - `ProjectionExprs` uses registry to estimate NDV, min/max, and null fraction through projected expressions - `FilterExec` uses registry selectivity as fallback when `check_support` returns false - Config option `optimizer.enable_expression_analyzer` (default false) to opt in; zero behavior change when disabled - Limitation: projections/filters created by optimizer rules after planning do not receive the registry and fall back to upstream behavior. Full coverage requires an operator-level statistics registry (orthogonal, will be tracked separately). ## Are these changes tested? - 15 unit tests for ExpressionAnalyzer (NDV, selectivity, min/max, null fraction, custom analyzers, chain delegation) - 31 projection tests (including new `test_project_statistics_with_expression_analyzer`) - 26 filter tests - 7 session state tests ## Are there any user-facing changes? New public API (purely additive, non-breaking): - `ExpressionAnalyzer` trait and `ExpressionAnalyzerRegistry` in `datafusion-physical-expr` - `SessionState::expression_analyzer_registry()` getter - `SessionStateBuilder::with_expression_analyzer_registry()` setter - `ProjectionExprs::with_expression_analyzer_registry()` setter - `FilterExecBuilder::with_expression_analyzer_registry()` setter - `ProjectionExec::with_expression_analyzer_registry()` setter - Config option `datafusion.optimizer.enable_expression_analyzer` No breaking changes. Default behavior is unchanged (config defaults to false). --- Disclaimer: I used AI to assist in the code generation, I have manually reviewed the output and it matches my intention and understanding. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
