[I] Proposal: Prune complex predicates by propagating column statistics [datafusion]

via GitHub Wed, 24 Dec 2025 21:37:39 -0800


2010YOUY01 opened a new issue, #19487:
URL: https://github.com/apache/datafusion/issues/19487


   ### Is your feature request related to a problem or challenge?
   
   ### Problem
   
   Today, pruning is hard to extend beyond a small set of “built-in” expression 
patterns. As a result, realistic predicates (nested expressions, 
functions/UDFs, etc.) are either not prunable or require optimizer-specific 
special cases that are hard to maintain.
   
   ### Proposal
   
   Introduce **expression-aware pruning via statistics propagation**: evaluate 
pruning by propagating available `ColumnStats` through the **physical 
expression tree**, and letting each expression/function optionally define how 
it transforms each statistics type.
   
   Below it will demonstrate how to prune a micro-partition with a very complex 
and nested expression easily, using this statistics propagation approach.
   
   The idea is inspired from [Pruning in Snowflake: Working Smarter, Not 
Harder]([https://arxiv.org/abs/2504.11540](https://arxiv.org/abs/2504.11540))
   
   Section 3 outlines only the high-level idea and does not specify an 
algorithm. The concrete implementation is developed in this write-up and 
accompanying POC PR, and is looking for feedbacks.
   
   I think this approach can also lead to clean solutions to various pruning 
related requirements including
   https://github.com/apache/datafusion/issues/19028
   https://github.com/apache/datafusion/issues/18320
   https://github.com/apache/datafusion/issues/1692
   
   ### Terms
   
   * **Micro-partition**: the smallest unit of storage-level pruning (e.g., 
partitioned file/ row groups/ pages in parquet). They are prunable because the 
have partition level statistics the pruning engine can use to evaluate.
   * **Zone map / range stats**: per micro-partition `min/max` (and maybe more 
complex stats) for a column.
   * **Null stats**: summaries like `null_count`, `row_count`, or derived 
states like `AllNull`, `AllNotNull`, `Unknown`.
   * **Set stats**: a summary of distinct values (exact set or bounded 
dictionary) inside a micro-partition, e.g., `{'toy', 'clothing'}`.
   * **Geo stats**: spatial summaries such as an MBR (minimum bounding 
rectangle) containing all geometries in a micro-partition.
   * **Pruning result (tri-state)**:
   
     * **SkipAll**: predicate is always false on the micro-partition → safe to 
prune.
     * **KeepAll**: predicate is always true on the micro-partition → scan can 
skip evaluating the predicate.
     * **Unknown**: predicate is mixed → must scan and evaluate the predicate.
   * **Statistics propagation**: evaluating an expression by transforming and 
combining statistics bottom-up along the expression tree.
   
   
   ## Example walkthrough
   
   ### Background: micro-partition pruning using column statistics
   
   Suppose we have a filter predicate `x > 5`, i.e.:
   
   ```sql
   SELECT * FROM t WHERE x > 5
   ```
   
   Assume the table is stored with micro-partition zone maps:
   
   * partition 1: `min=0,  max=3`
   * partition 2: `min=2,  max=12`
   * partition 3: `min=8,  max=20`
   
   Then for `x > 5`:
   
   * `x > 5` on `[0,3]`  → **SkipAll** (always false)
   * `x > 5` on `[2,12]` → **Unknown** (mixed)
   * `x > 5` on `[8,20]` → **KeepAll** (always true)
   
   This demonstrates why tri-state matters: **KeepAll** lets us skip predicate 
evaluation during scan when it is guaranteed true.
   
   ### Issue: making it work on real workloads is hard
   
   Realistic filter predicates are nested and include functions/UDFs. Column 
statistics can also be diverse (min/max, null summaries, sets/dictionaries, 
geospatial summaries, etc.). Implementing a pruning framework that composes 
these cleanly is hard.
   
   Consider this workload:
   
   ```sql
   -- TODO: describe table schema
   
   SELECT *
   FROM t
   WHERE
       ((a * 7) > 100)
       OR (b IS NULL)
       OR (UPPER(c) IN ('ELECTRONIC', 'BOOK'))
       OR (ST_INTERSECTS(d, MBR_EUROPE))
   ```
   
   And assume we have a micro-partition with rich statistics:
   
   ```text
   a: min=0, max=10
   b: AllNotNull
   c: all values within set {'toy', 'clothing'}
   d: all geometries are within MBR_CHINA
   ```
   
   If we evaluate the predicate using all these statistics, this 
micro-partition can be safely skipped:
   
   * Sub-expr 1: `(a * 7) > 100`
   
     * `a` in `[0,10]` ⇒ `a * 7` in `[0,70]` ⇒ always false ⇒ **SkipAll**
   * Sub-expr 2: `b IS NULL`
   
     * `b` is `AllNotNull` ⇒ always false ⇒ **SkipAll**
   * Sub-expr 3: `UPPER(c) IN ('ELECTRONIC', 'BOOK')`
   
     * `c` in `{'toy','clothing'}` ⇒ `UPPER(c)` in `{'TOY','CLOTHING'}` ⇒ 
always false ⇒ **SkipAll**
   * Sub-expr 4: `ST_INTERSECTS(d, MBR_EUROPE)`
   
     * `d` within `MBR_CHINA` ⇒ if `MBR_CHINA ∩ MBR_EUROPE = ∅`, then always 
false ⇒ **SkipAll**
   
   For `OR`, if every branch is **SkipAll**, then the whole predicate is 
**SkipAll**, so the micro-partition can be pruned.
   
   Here are some figures to walk through the Stats/PruningResult propagation 
steps, the geospatial sub-expression is ignored.
   
   <img width="1174" height="806" alt="Image" 
src="https://github.com/user-attachments/assets/6505ba1e-c164-422d-bf93-c005c20d3b1c";
 />
   
   *Figure 1. Stats Propagation for sub-expression `(a*7)>100`*
   
   <img width="544" height="488" alt="Image" 
src="https://github.com/user-attachments/assets/c10329a4-6ebe-4664-999b-4afdaf261261";
 />
   
   *Figure 2. Stats Propagation for sub-expression `b IS NULL`*
   
   <img width="1270" height="535" alt="Image" 
src="https://github.com/user-attachments/assets/d5e89d64-7046-4f4c-8091-76bd37774357";
 />
   
   *Figure 3. Stats Propagation for sub-expression `UPPER(c) in ('ELECTRONIC', 
'BOOK')`*
   
   <img width="1585" height="845" alt="Image" 
src="https://github.com/user-attachments/assets/cbce6ae9-3248-47ab-803d-e4381ec5021b";
 />
   
   *Figure 4. Stats Propagation for final evaluation on `OR`s*
   
   ## Current Implementation State
   
   `PruningPredicate` is the main place to implement pruning. It supports basic 
range and null-summary arithmetic for a subset of expressions, but it 
effectively supports only two outcomes: `SkipAll` or `Other`.
   
   Code: 
[https://github.com/apache/datafusion/blob/6ce237492d9f75477c594ba132b2575932122dd6/datafusion/pruning/src/pruning_predicate.rs#L51](https://github.com/apache/datafusion/blob/6ce237492d9f75477c594ba132b2575932122dd6/datafusion/pruning/src/pruning_predicate.rs#L51)
   
   The idea is converting the raw predicate, to another predicate that involves 
partition min/max and other stats, and can be directly evaluated to a boolean 
value that decides whether to prune.
   
   Potential issues:
   
   * **No tri-state output.** It’s missing **KeepAll**, which can be used to 
skip redundant filtering.
   
     * Example: predicate `x > 0` with stats `min=10, max=20` should be 
**KeepAll** (scan, but don’t evaluate the predicate).
   * **Hard to extend for new functions/UDFs.** Implementing interval 
arithmetic or pruning semantics for not-yet-supported expressions often 
requires deep changes inside the optimizer.
   
   As a workaround, tri-state support and set/geo pruning can be implemented as 
extensions or special cases. This increases complexity and makes pruning for 
nested expressions even harder.
   
   
   ## Expr-aware pruning with stats propagation
   
   The idea of statistics propagation addresses the above limitations while 
staying simple:
   
   * **Unified mechanism** for multiple statistics types.
   * **Extensible**: supports new functions/expressions by defining APIs on 
expressions/UDFs, rather than wiring special cases into the optimizer.
   * **Naturally supports nested expressions**: the expression tree structure 
provides the composition.
   
   
   ## Goal
   
   Enable pruning for complex/nested expressions by propagating and combining 
`ColumnStats`.
   
   Constraints / scope:
   
   * Expressions reference columns from a **single table**.
   * Expressions may be arbitrarily nested.
   * Column statistics are optional; the pruning engine should use whatever is 
available to make best-effort attempts to prune the micro-partitions.
   
   Example:
   
   > We have table columns `a, b` and want to prune with the predicate `(pow(a, 
2) + b) * 2 > 100`.
   
   Statistics container sketch:
   
   ```rust
   struct ColumnStats {
       // e.g. range [1, 10]
       range_stat: Option<RangeStats>,
       // e.g. null_count and row_count to infer all-null / all-non-null
       null_stat: Option<NullStats>,
       // e.g. values within set {"a", "b"}
       set_stat: Option<SetStats>,
       // e.g. geometries contained by a minimum bounding rectangle
       geo_stat: Option<GeoStats>,
       // ...
   }
   ```
   
   
   ## Implementation
   
   ### Key data structures
   
   During pruning, each node in the physical expression tree produces either:
   
   * propagated statistics (`ColumnStats`), or
   * a pruning decision (tri-state) at predicate nodes.
   
   ```rust
   #[derive(Debug, Clone)]
   pub enum PruningIntermediate {
       IntermediateStats(ColumnStats),
       IntermediateResult(Vec<PruningResult>),
   }
   
   pub trait PhysicalExpr {
       // ...
       fn evaluate_pruning(&self, ctx: Arc<PruningContext>) -> 
Result<PruningIntermediate> {
           // ...
       }
   }
   ```
   
   ### Algorithm
   
   #### 1) Arithmetic nodes (e.g. `+`, `-`, `abs()`, `upper()`, UDFs)
   
   Propagate each statistics type independently via optional APIs:
   
   ```rust
   fn propagate_range_stats(
       &self,
       _child_range_stats: &[RangeStats],
   ) -> Result<Option<RangeStats>> {
       Ok(None)
   }
   
   fn propagate_null_stats(
       &self,
       _child_null_stats: &[NullStats],
   ) -> Result<Option<NullStats>> {
       Ok(None)
   }
   
   fn propagate_set_stats(
       &self,
       _child_set_stats: &[SetStats],
   ) -> Result<Option<SetStats>> {
       Ok(None)
   }
   ```
   
   `evaluate_pruning()` can have a default implementation that:
   
   1. Calls `evaluate_pruning()` on children.
   2. Collects available child stats by type.
   3. Tries each `propagate_*` hook.
   4. Returns `IntermediateStats(ColumnStats)` with whatever it can derive.
   
   #### 2) Predicate nodes (e.g. `>`, `<`, `=`, `IN`, spatial predicates)
   
   1. Evaluate the predicate using available **non-null stats** 
(range/set/geo/etc.), producing a partial tri-state result.
   
      * If any stats type proves **SkipAll** or **KeepAll**, it may 
short-circuit (one `SkipAll` means the final combined partial result is also 
`SkipAll`)
      * Conflicts (both **SkipAll** and **KeepAll**) indicate a bug or 
incompatible semantics.
   2. Combine with `NullStats` to produce the final result, and consider 
various edge cases. For example, when the partial combined result is `KeepAll`, 
it must first ensure both side has `NullStats` `AllNonNull`, to make the final 
result also `KeepAll`.
   
   
   ### Describe the solution you'd like
   
   _No response_
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Proposal: Prune complex predicates by propagating column statistics [datafusion]

Reply via email to