mbutrovich commented on PR #22026:
URL: https://github.com/apache/datafusion/pull/22026#issuecomment-4382959515

   Thanks again for the review @adriangb! Hopefully I addressed all of the 
feedback, but happy to keep chatting about it.
   
   **Mixed virtual/file predicates with `pushdown_filters=true`**
   
   Confirmed the silent-drop bug with failing tests. Root cause: 
`ParquetSource::try_pushdown_filters` called 
`can_expr_be_pushed_down_with_schemas` with the full table schema (now 
including virtual columns), so filters referencing `row_number` were marked 
`PushedDown::Yes` → `FilterExec` removed → the scan's `build_row_filter` 
couldn't resolve the virtual-col ref against `physical_file_schema` and 
silently dropped the conjunct.
   
   Arrow-rs can't accept virtual-column refs in a `RowFilter` at all: 
`ArrowPredicate::projection()` returns a `ProjectionMask` over parquet leaves 
only, and virtual columns are synthesized after filter evaluation. So virtual 
columns are projectable but never pushable.
   
   Fix: added `TableSchema::schema_without_virtual_columns()` (file + 
partition, excluding virtual) and `try_pushdown_filters` uses that. Virtual-col 
filters are now reported `PushedDown::No` and the `FilterExec` stays above the 
scan.
   
   Defense-in-depth in the opener for callers who bypass the optimizer (e.g. 
manual plan builders): `prepare_open_file` rejects `pushdown_filters=true` + 
virtual-col predicate with a clear error pointing at 
`with_pushdown_filters(false)` or keeping the filter above the scan.
   
   Tests: `source.rs::test_try_pushdown_filters_rejects_virtual_column_refs` 
(planner boundary), plus three opener-level tests covering mixed OR, 
virtual-only, and the allowed `pushdown_filters=false` case.
   
   **Ordering doc on `virtual_columns`**
   
   Struct field doc now spells out the `[file, partition, virtual]` layout, 
matching the builder methods.
   
   **Enum + `TryFrom`**
   
   Added `ParquetVirtualColumn` with `TryFrom<&FieldRef>` in a new 
`virtual_column.rs`. The runtime allowlist in the opener is replaced with 
`ParquetVirtualColumn::try_from(field)?`. Adding a new variant (e.g. 
`RowGroupIndex`) is now a compile-time obligation, and consumers can 
pattern-match instead of string-comparing extension-type names. Exposed as `pub 
use ParquetVirtualColumn` at the crate root.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to