mbutrovich opened a new pull request, #22026: URL: https://github.com/apache/datafusion/pull/22026
## Which issue does this PR close? - Part of #20135 (epic: virtual / metadata columns). Does not close that epic; see [this comment](https://github.com/apache/datafusion/issues/20135#issuecomment-4381652523) describing the scope split. - Revives #20133 (auto-closed stale) — same core plumbing, credit to @jkylling. - Unblocks apache/datafusion-comet#3432 (remove native_datafusion fallback for Spark's `_tmp_metadata_row_index`). ## Rationale for this change arrow-rs 57.1.0+ supports Parquet virtual columns (`row_number`, `row_group_index`) via `ArrowReaderOptions::with_virtual_columns`, and DataFusion pins a new-enough arrow-rs for the API to be available. DataFusion does not yet plumb the option through `ParquetOpener`, so consumers (notably Comet) cannot project Spark's `_tmp_metadata_row_index` through the native_datafusion scan path. This PR adds the minimal opener-boundary plumbing so `TableSchema` can carry virtual columns and the Parquet reader produces them. UX / SQL-layer surface for virtual columns stays deferred to the epic in #20135 — this follows the same framing alamb blessed for #20071 (the `input_file_name()` UDF). ## What changes are included in this PR? - `TableSchema::with_virtual_columns(...)` builder + `virtual_columns()` getter. Layout: `[file, partition, virtual]`. Composable with `with_table_partition_cols` in either order. - `ParquetOpener` forwards the fields to `ArrowReaderOptions::with_virtual_columns`; augments the schemas passed to the expr-adapter / simplifier with virtual fields so virtual-col refs identity-rewrite; strips them from the projection fed to `ProjectionMask::roots` (which only understands file columns) and appends them to `stream_schema` so `reassign_expr_columns` resolves them by name. - An allowlist (`SUPPORTED_VIRTUAL_EXTENSION_TYPES`) gates which arrow-rs virtual extension types are accepted. Currently only `RowNumber`. `RowGroupIndex` will land in a follow-up with its own test, so a future arrow-rs release can't silently add untested behavior. - `arrow-schema` added as a direct dep (previously transitive via `arrow`) so the allowlist references `RowNumber::NAME` from arrow-rs instead of hardcoding the string. - Explicitly **not** in scope (follow-ups): `ListingTable` / SQL-layer surface, a three-arg constructor on `TableSchema`, `ParquetSource::with_virtual_columns`, and `RowGroupIndex` support. ## Are these changes tested? Yes. New unit tests in `opener.rs`: - `test_row_index_basic` — single row group, select data + row_number. - `test_row_index_projection_only` — select only row_number. - `test_row_index_multi_row_group` — 3 × 100 rows, verify absolute 0..300 across boundaries. - `test_row_index_with_row_group_skip` — predicate stats-prunes the middle row group; verify row numbers stay absolute (0..100 ++ 200..300). Critical correctness gate for Spark (and for apache/arrow-rs#8863). - `test_row_index_with_partition_cols` — partition + virtual + data columns compose correctly. - `test_row_index_nullable_int64` — nullability flag flows through unchanged (matches Spark's `_tmp_metadata_row_index` declaration). - `test_unsupported_virtual_extension_type_rejected` — using `RowGroupIndex` (a real arrow-rs type deliberately not in the allowlist) errors with `NotImplemented` instead of silently forwarding. Plus a `TableSchema` unit test verifying the `[file, partition, virtual]` layout is stable regardless of builder-call order. ## Are there any user-facing changes? Public API additions on `TableSchema`: `with_virtual_columns(...)` and `virtual_columns()`. No existing API changed; no breaking changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
