xudong963 opened a new pull request, #21637:
URL: https://github.com/apache/datafusion/pull/21637

   ## Which issue does this PR close?
   
   - Closes #19028.
   
   ## Rationale for this change
   
   When DataFusion evaluates a Parquet scan with filter pushdown, it uses row 
group statistics to determine which row groups contain matching rows. The 
`RowGroupAccessPlanFilter` already tracks which row groups are "fully matched" 
— where statistics prove that **all** rows satisfy the predicate (via 
`is_fully_matched`).
   
   However, this information was not propagated downstream. Even for fully 
matched row groups:
   1. **Page index pruning** still evaluated page-level statistics (wasted work 
since no pages can be pruned)
   2. **RowFilter evaluation** still decoded filter columns and evaluated the 
predicate for every row (wasted work since every row passes)
   
   This is especially costly when filter columns are expensive to decode (e.g., 
large strings) or when predicates are complex. Common real-world examples 
include time-range filters where entire row groups fall within the range, or 
`WHERE status != 'DELETED'` on data with no deleted rows.
   
   ## What changes are included in this PR?
   
   ### DataFusion changes (this PR)
   
   1. **`row_group_filter.rs`**: `RowGroupAccessPlanFilter::build()` now 
returns `(ParquetAccessPlan, Vec<usize>)` — the access plan plus the indices of 
fully matched row groups.
   
   2. **`page_filter.rs`**: `prune_plan_with_page_index()` accepts a 
`fully_matched_row_groups` parameter and skips page-level pruning for those row 
groups.
   
   3. **`opener.rs`**: Wires fully matched row groups through the pipeline — 
passes them to page pruning and to the `ParquetPushDecoderBuilder` via 
`with_fully_matched_row_groups()`.
   
   ### Arrow-rs dependency (apache/arrow-rs#9694)
   
   The new `ArrowReaderBuilder::with_fully_matched_row_groups()` API in 
arrow-rs allows skipping `RowFilter` evaluation during Parquet decoding for 
specified row groups. This PR uses `[patch.crates-io]` pointing to the arrow-rs 
fork branch until that PR is merged and released.
   
   ### Benchmark
   
   Includes a criterion benchmark (`parquet_fully_matched_filter`) using 
`ParquetPushDecoder` directly — the same code path DataFusion's async opener 
uses. Dataset: 20 row groups × 50K rows, with a 1KB string payload column and 
predicate `x < 200` (all row groups fully matched).
   
   | Scenario | Time | vs. baseline |
   |----------|------|-------------|
   | Filter pushdown, no skip | ~43 ms | baseline |
   | **Filter pushdown, with skip** | **~20 ms** | **~2.2x faster** |
   | No pushdown at all | ~24 ms | — |
   
   ## Are these changes tested?
   
   - All 82 existing non-submodule `datafusion-datasource-parquet` tests pass 
(16 failures are pre-existing, caused by missing `parquet-testing` submodule)
   - The benchmark verifies correctness by asserting the expected row count
   - Clippy and fmt pass
   
   ## Are there any user-facing changes?
   
   No user-facing API changes. This is a transparent performance optimization — 
queries that previously worked will now be faster when row group statistics 
prove all rows match the predicate.
   
   **Note:** This PR depends on apache/arrow-rs#9694. The `[patch.crates-io]` 
in `Cargo.toml` will be removed once that arrow-rs change is released.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to