adriangb opened a new pull request, #22237: URL: https://github.com/apache/datafusion/pull/22237
## Which issue does this PR close? - Part of #22144 (Adaptive filter pushdown), split into a reviewable stack. This is **PR 4 of 4** — the integration. ## Rationale for this change With the wrapper type (#22234), per-conjunct pruning stats (#22235), and the cost model (#22236) in place, this PR wires them into the parquet scan so filter placement adapts to measured selectivity and throughput instead of a fixed pushdown decision. ## What changes are included in this PR? - `ParquetMorselizer` carries tagged predicate conjuncts and a shared `SelectivityTracker`; at file open the tracker partitions conjuncts into row-level vs post-scan buckets, seeded by the per-conjunct row-group / page-index pruning rates collected for free during pruning. - `AdaptiveParquetStream` drives the push decoder one row group at a time, re-partitioning at row-group boundaries and swapping the decoder strategy (row filter + projection mask) when placement changes. - Integrates with the fully-matched run splitting from #21637: fully-matched runs get a no-filter decoder; needs-filter runs get the adaptive setup. - `HashJoinExec` wraps its pushed-down dynamic filter in `OptionalFilterPhysicalExpr` so the tracker may drop it when it is not cost-effective; join correctness is unaffected. - Adds config knobs: `filter_pushdown_min_bytes_per_sec`, `filter_collecting_byte_ratio_threshold`, `filter_confidence_z`. ## Are these changes tested? Yes — parquet filter-pushdown integration tests, physical-optimizer filter-pushdown tests, proto round-trip, and sqllogictest coverage. ## Are there any user-facing changes? New parquet read config knobs (documented in `configs.md`). Behavior change to the parquet scan's filter placement. **Note:** this PR pins a custom `arrow-rs` branch for the push-decoder `StrategySwap` APIs; landing upstream requires those APIs in a released `arrow-rs` first. --- **Stacked PR — diff is cumulative against `main`.** Review the top commit *"feat: adaptive filter pushdown for the parquet scan"*; the commits below it are PRs #22234, #22235, #22236. Stack (review/merge in order): 1. #22234 — OptionalFilterPhysicalExpr + proto 2. #22235 — Per-conjunct pruning statistics 3. #22236 — SelectivityTracker cost model 4. **this PR** — Adaptive parquet scan integration -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
