mbutrovich commented on PR #22026: URL: https://github.com/apache/datafusion/pull/22026#issuecomment-4382959515
Thanks again for the review @adriangb! Hopefully I addressed all of the feedback, but happy to keep chatting about it. **Mixed virtual/file predicates with `pushdown_filters=true`** Confirmed the silent-drop bug with failing tests. Root cause: `ParquetSource::try_pushdown_filters` called `can_expr_be_pushed_down_with_schemas` with the full table schema (now including virtual columns), so filters referencing `row_number` were marked `PushedDown::Yes` → `FilterExec` removed → the scan's `build_row_filter` couldn't resolve the virtual-col ref against `physical_file_schema` and silently dropped the conjunct. Arrow-rs can't accept virtual-column refs in a `RowFilter` at all: `ArrowPredicate::projection()` returns a `ProjectionMask` over parquet leaves only, and virtual columns are synthesized after filter evaluation. So virtual columns are projectable but never pushable. Fix: added `TableSchema::schema_without_virtual_columns()` (file + partition, excluding virtual) and `try_pushdown_filters` uses that. Virtual-col filters are now reported `PushedDown::No` and the `FilterExec` stays above the scan. Defense-in-depth in the opener for callers who bypass the optimizer (e.g. manual plan builders): `prepare_open_file` rejects `pushdown_filters=true` + virtual-col predicate with a clear error pointing at `with_pushdown_filters(false)` or keeping the filter above the scan. Tests: `source.rs::test_try_pushdown_filters_rejects_virtual_column_refs` (planner boundary), plus three opener-level tests covering mixed OR, virtual-only, and the allowed `pushdown_filters=false` case. **Ordering doc on `virtual_columns`** Struct field doc now spells out the `[file, partition, virtual]` layout, matching the builder methods. **Enum + `TryFrom`** Added `ParquetVirtualColumn` with `TryFrom<&FieldRef>` in a new `virtual_column.rs`. The runtime allowlist in the opener is replaced with `ParquetVirtualColumn::try_from(field)?`. Adding a new variant (e.g. `RowGroupIndex`) is now a compile-time obligation, and consumers can pattern-match instead of string-comparing extension-type names. Exposed as `pub use ParquetVirtualColumn` at the crate root. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
