2010YOUY01 opened a new issue, #19487: URL: https://github.com/apache/datafusion/issues/19487
### Is your feature request related to a problem or challenge? ### Problem Today, pruning is hard to extend beyond a small set of “built-in” expression patterns. As a result, realistic predicates (nested expressions, functions/UDFs, etc.) are either not prunable or require optimizer-specific special cases that are hard to maintain. ### Proposal Introduce **expression-aware pruning via statistics propagation**: evaluate pruning by propagating available `ColumnStats` through the **physical expression tree**, and letting each expression/function optionally define how it transforms each statistics type. Below it will demonstrate how to prune a micro-partition with a very complex and nested expression easily, using this statistics propagation approach. The idea is inspired from [Pruning in Snowflake: Working Smarter, Not Harder]([https://arxiv.org/abs/2504.11540](https://arxiv.org/abs/2504.11540)) Section 3 outlines only the high-level idea and does not specify an algorithm. The concrete implementation is developed in this write-up and accompanying POC PR, and is looking for feedbacks. I think this approach can also lead to clean solutions to various pruning related requirements including https://github.com/apache/datafusion/issues/19028 https://github.com/apache/datafusion/issues/18320 https://github.com/apache/datafusion/issues/1692 ### Terms * **Micro-partition**: the smallest unit of storage-level pruning (e.g., partitioned file/ row groups/ pages in parquet). They are prunable because the have partition level statistics the pruning engine can use to evaluate. * **Zone map / range stats**: per micro-partition `min/max` (and maybe more complex stats) for a column. * **Null stats**: summaries like `null_count`, `row_count`, or derived states like `AllNull`, `AllNotNull`, `Unknown`. * **Set stats**: a summary of distinct values (exact set or bounded dictionary) inside a micro-partition, e.g., `{'toy', 'clothing'}`. * **Geo stats**: spatial summaries such as an MBR (minimum bounding rectangle) containing all geometries in a micro-partition. * **Pruning result (tri-state)**: * **SkipAll**: predicate is always false on the micro-partition → safe to prune. * **KeepAll**: predicate is always true on the micro-partition → scan can skip evaluating the predicate. * **Unknown**: predicate is mixed → must scan and evaluate the predicate. * **Statistics propagation**: evaluating an expression by transforming and combining statistics bottom-up along the expression tree. ## Example walkthrough ### Background: micro-partition pruning using column statistics Suppose we have a filter predicate `x > 5`, i.e.: ```sql SELECT * FROM t WHERE x > 5 ``` Assume the table is stored with micro-partition zone maps: * partition 1: `min=0, max=3` * partition 2: `min=2, max=12` * partition 3: `min=8, max=20` Then for `x > 5`: * `x > 5` on `[0,3]` → **SkipAll** (always false) * `x > 5` on `[2,12]` → **Unknown** (mixed) * `x > 5` on `[8,20]` → **KeepAll** (always true) This demonstrates why tri-state matters: **KeepAll** lets us skip predicate evaluation during scan when it is guaranteed true. ### Issue: making it work on real workloads is hard Realistic filter predicates are nested and include functions/UDFs. Column statistics can also be diverse (min/max, null summaries, sets/dictionaries, geospatial summaries, etc.). Implementing a pruning framework that composes these cleanly is hard. Consider this workload: ```sql -- TODO: describe table schema SELECT * FROM t WHERE ((a * 7) > 100) OR (b IS NULL) OR (UPPER(c) IN ('ELECTRONIC', 'BOOK')) OR (ST_INTERSECTS(d, MBR_EUROPE)) ``` And assume we have a micro-partition with rich statistics: ```text a: min=0, max=10 b: AllNotNull c: all values within set {'toy', 'clothing'} d: all geometries are within MBR_CHINA ``` If we evaluate the predicate using all these statistics, this micro-partition can be safely skipped: * Sub-expr 1: `(a * 7) > 100` * `a` in `[0,10]` ⇒ `a * 7` in `[0,70]` ⇒ always false ⇒ **SkipAll** * Sub-expr 2: `b IS NULL` * `b` is `AllNotNull` ⇒ always false ⇒ **SkipAll** * Sub-expr 3: `UPPER(c) IN ('ELECTRONIC', 'BOOK')` * `c` in `{'toy','clothing'}` ⇒ `UPPER(c)` in `{'TOY','CLOTHING'}` ⇒ always false ⇒ **SkipAll** * Sub-expr 4: `ST_INTERSECTS(d, MBR_EUROPE)` * `d` within `MBR_CHINA` ⇒ if `MBR_CHINA ∩ MBR_EUROPE = ∅`, then always false ⇒ **SkipAll** For `OR`, if every branch is **SkipAll**, then the whole predicate is **SkipAll**, so the micro-partition can be pruned. Here are some figures to walk through the Stats/PruningResult propagation steps, the geospatial sub-expression is ignored. <img width="1174" height="806" alt="Image" src="https://github.com/user-attachments/assets/6505ba1e-c164-422d-bf93-c005c20d3b1c" /> *Figure 1. Stats Propagation for sub-expression `(a*7)>100`* <img width="544" height="488" alt="Image" src="https://github.com/user-attachments/assets/c10329a4-6ebe-4664-999b-4afdaf261261" /> *Figure 2. Stats Propagation for sub-expression `b IS NULL`* <img width="1270" height="535" alt="Image" src="https://github.com/user-attachments/assets/d5e89d64-7046-4f4c-8091-76bd37774357" /> *Figure 3. Stats Propagation for sub-expression `UPPER(c) in ('ELECTRONIC', 'BOOK')`* <img width="1585" height="845" alt="Image" src="https://github.com/user-attachments/assets/cbce6ae9-3248-47ab-803d-e4381ec5021b" /> *Figure 4. Stats Propagation for final evaluation on `OR`s* ## Current Implementation State `PruningPredicate` is the main place to implement pruning. It supports basic range and null-summary arithmetic for a subset of expressions, but it effectively supports only two outcomes: `SkipAll` or `Other`. Code: [https://github.com/apache/datafusion/blob/6ce237492d9f75477c594ba132b2575932122dd6/datafusion/pruning/src/pruning_predicate.rs#L51](https://github.com/apache/datafusion/blob/6ce237492d9f75477c594ba132b2575932122dd6/datafusion/pruning/src/pruning_predicate.rs#L51) The idea is converting the raw predicate, to another predicate that involves partition min/max and other stats, and can be directly evaluated to a boolean value that decides whether to prune. Potential issues: * **No tri-state output.** It’s missing **KeepAll**, which can be used to skip redundant filtering. * Example: predicate `x > 0` with stats `min=10, max=20` should be **KeepAll** (scan, but don’t evaluate the predicate). * **Hard to extend for new functions/UDFs.** Implementing interval arithmetic or pruning semantics for not-yet-supported expressions often requires deep changes inside the optimizer. As a workaround, tri-state support and set/geo pruning can be implemented as extensions or special cases. This increases complexity and makes pruning for nested expressions even harder. ## Expr-aware pruning with stats propagation The idea of statistics propagation addresses the above limitations while staying simple: * **Unified mechanism** for multiple statistics types. * **Extensible**: supports new functions/expressions by defining APIs on expressions/UDFs, rather than wiring special cases into the optimizer. * **Naturally supports nested expressions**: the expression tree structure provides the composition. ## Goal Enable pruning for complex/nested expressions by propagating and combining `ColumnStats`. Constraints / scope: * Expressions reference columns from a **single table**. * Expressions may be arbitrarily nested. * Column statistics are optional; the pruning engine should use whatever is available to make best-effort attempts to prune the micro-partitions. Example: > We have table columns `a, b` and want to prune with the predicate `(pow(a, 2) + b) * 2 > 100`. Statistics container sketch: ```rust struct ColumnStats { // e.g. range [1, 10] range_stat: Option<RangeStats>, // e.g. null_count and row_count to infer all-null / all-non-null null_stat: Option<NullStats>, // e.g. values within set {"a", "b"} set_stat: Option<SetStats>, // e.g. geometries contained by a minimum bounding rectangle geo_stat: Option<GeoStats>, // ... } ``` ## Implementation ### Key data structures During pruning, each node in the physical expression tree produces either: * propagated statistics (`ColumnStats`), or * a pruning decision (tri-state) at predicate nodes. ```rust #[derive(Debug, Clone)] pub enum PruningIntermediate { IntermediateStats(ColumnStats), IntermediateResult(Vec<PruningResult>), } pub trait PhysicalExpr { // ... fn evaluate_pruning(&self, ctx: Arc<PruningContext>) -> Result<PruningIntermediate> { // ... } } ``` ### Algorithm #### 1) Arithmetic nodes (e.g. `+`, `-`, `abs()`, `upper()`, UDFs) Propagate each statistics type independently via optional APIs: ```rust fn propagate_range_stats( &self, _child_range_stats: &[RangeStats], ) -> Result<Option<RangeStats>> { Ok(None) } fn propagate_null_stats( &self, _child_null_stats: &[NullStats], ) -> Result<Option<NullStats>> { Ok(None) } fn propagate_set_stats( &self, _child_set_stats: &[SetStats], ) -> Result<Option<SetStats>> { Ok(None) } ``` `evaluate_pruning()` can have a default implementation that: 1. Calls `evaluate_pruning()` on children. 2. Collects available child stats by type. 3. Tries each `propagate_*` hook. 4. Returns `IntermediateStats(ColumnStats)` with whatever it can derive. #### 2) Predicate nodes (e.g. `>`, `<`, `=`, `IN`, spatial predicates) 1. Evaluate the predicate using available **non-null stats** (range/set/geo/etc.), producing a partial tri-state result. * If any stats type proves **SkipAll** or **KeepAll**, it may short-circuit (one `SkipAll` means the final combined partial result is also `SkipAll`) * Conflicts (both **SkipAll** and **KeepAll**) indicate a bug or incompatible semantics. 2. Combine with `NullStats` to produce the final result, and consider various edge cases. For example, when the partial combined result is `KeepAll`, it must first ensure both side has `NullStats` `AllNonNull`, to make the final result also `KeepAll`. ### Describe the solution you'd like _No response_ ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
