asolimando opened a new pull request, #21122:
URL: https://github.com/apache/datafusion/pull/21122

   ## Which issue does this PR close?
   
   Part of #21120 (framework + projection/filter integration)
   
   ## Rationale for this change
   
   DataFusion currently loses expression-level statistics when computing plan 
metadata. Projected expressions that aren't bare columns or literals get 
unknown statistics, and filter selectivity falls back to a hardcoded 20% when 
interval analysis cannot handle the predicate. There is also no extension point 
for users to provide statistics for their own UDFs.
   
   This PR introduces ExpressionAnalyzer, a pluggable chain-of-responsibility 
framework that addresses these gaps. It follows the same extensibility pattern 
used elsewhere in DataFusion (ExprPlanner, OptimizerRule).
   
   Addresses reviewer feedback from #19957: chain delegation, SessionState 
integration, own folder.
   
   ## What changes are included in this PR?
   
   - `ExpressionAnalyzer` trait with `registry` parameter for chain delegation
   - `ExpressionAnalyzerRegistry` to chain analyzers (first `Computed` wins)
   - `DefaultExpressionAnalyzer`: Selinger-style estimation for columns, 
literals, binary expressions (AND/OR/NOT/comparisons), arithmetic
   - `ExpressionAnalyzerRegistry` stored in `SessionState`, injected into 
`ProjectionExec` and `FilterExec` by the planner
   - `ProjectionExprs` uses registry to estimate NDV, min/max, and null 
fraction through projected expressions
   - `FilterExec` uses registry selectivity as fallback when `check_support` 
returns false
   - Config option `optimizer.enable_expression_analyzer` (default false) to 
opt in; zero behavior change when disabled
   - Limitation: projections/filters created by optimizer rules after planning 
do not receive the registry and fall back to upstream behavior. Full coverage 
requires an operator-level statistics registry (orthogonal, will be tracked 
separately).
   
   ## Are these changes tested?
   
   - 15 unit tests for ExpressionAnalyzer (NDV, selectivity, min/max, null 
fraction, custom analyzers, chain delegation)
   - 31 projection tests (including new 
`test_project_statistics_with_expression_analyzer`)
   - 26 filter tests
   - 7 session state tests
   
   ## Are there any user-facing changes?
   
   New public API (purely additive, non-breaking):
   - `ExpressionAnalyzer` trait and `ExpressionAnalyzerRegistry` in 
`datafusion-physical-expr`
   - `SessionState::expression_analyzer_registry()` getter
   - `SessionStateBuilder::with_expression_analyzer_registry()` setter
   - `ProjectionExprs::with_expression_analyzer_registry()` setter
   - `FilterExecBuilder::with_expression_analyzer_registry()` setter
   - `ProjectionExec::with_expression_analyzer_registry()` setter
   - Config option `datafusion.optimizer.enable_expression_analyzer`
   
   No breaking changes. Default behavior is unchanged (config defaults to 
false).
   
   ---
   
   Disclaimer: I used AI to assist in the code generation, I have manually 
reviewed the output and it matches my intention and understanding.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to