xudong963 opened a new pull request, #21637: URL: https://github.com/apache/datafusion/pull/21637
## Which issue does this PR close? - Closes #19028. ## Rationale for this change When DataFusion evaluates a Parquet scan with filter pushdown, it uses row group statistics to determine which row groups contain matching rows. The `RowGroupAccessPlanFilter` already tracks which row groups are "fully matched" — where statistics prove that **all** rows satisfy the predicate (via `is_fully_matched`). However, this information was not propagated downstream. Even for fully matched row groups: 1. **Page index pruning** still evaluated page-level statistics (wasted work since no pages can be pruned) 2. **RowFilter evaluation** still decoded filter columns and evaluated the predicate for every row (wasted work since every row passes) This is especially costly when filter columns are expensive to decode (e.g., large strings) or when predicates are complex. Common real-world examples include time-range filters where entire row groups fall within the range, or `WHERE status != 'DELETED'` on data with no deleted rows. ## What changes are included in this PR? ### DataFusion changes (this PR) 1. **`row_group_filter.rs`**: `RowGroupAccessPlanFilter::build()` now returns `(ParquetAccessPlan, Vec<usize>)` — the access plan plus the indices of fully matched row groups. 2. **`page_filter.rs`**: `prune_plan_with_page_index()` accepts a `fully_matched_row_groups` parameter and skips page-level pruning for those row groups. 3. **`opener.rs`**: Wires fully matched row groups through the pipeline — passes them to page pruning and to the `ParquetPushDecoderBuilder` via `with_fully_matched_row_groups()`. ### Arrow-rs dependency (apache/arrow-rs#9694) The new `ArrowReaderBuilder::with_fully_matched_row_groups()` API in arrow-rs allows skipping `RowFilter` evaluation during Parquet decoding for specified row groups. This PR uses `[patch.crates-io]` pointing to the arrow-rs fork branch until that PR is merged and released. ### Benchmark Includes a criterion benchmark (`parquet_fully_matched_filter`) using `ParquetPushDecoder` directly — the same code path DataFusion's async opener uses. Dataset: 20 row groups × 50K rows, with a 1KB string payload column and predicate `x < 200` (all row groups fully matched). | Scenario | Time | vs. baseline | |----------|------|-------------| | Filter pushdown, no skip | ~43 ms | baseline | | **Filter pushdown, with skip** | **~20 ms** | **~2.2x faster** | | No pushdown at all | ~24 ms | — | ## Are these changes tested? - All 82 existing non-submodule `datafusion-datasource-parquet` tests pass (16 failures are pre-existing, caused by missing `parquet-testing` submodule) - The benchmark verifies correctness by asserting the expected row count - Clippy and fmt pass ## Are there any user-facing changes? No user-facing API changes. This is a transparent performance optimization — queries that previously worked will now be faster when row group statistics prove all rows match the predicate. **Note:** This PR depends on apache/arrow-rs#9694. The `[patch.crates-io]` in `Cargo.toml` will be removed once that arrow-rs change is released. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
