[PR] feat: Plumb Parquet virtual columns (row_number) through TableSchema and ParquetOpener [datafusion]

via GitHub Tue, 05 May 2026 10:51:57 -0700


mbutrovich opened a new pull request, #22026:
URL: https://github.com/apache/datafusion/pull/22026


   ## Which issue does this PR close?
   
   - Part of #20135 (epic: virtual / metadata columns). Does not close that 
epic; see [this 
comment](https://github.com/apache/datafusion/issues/20135#issuecomment-4381652523)
 describing the scope split.
   - Revives #20133 (auto-closed stale) — same core plumbing, credit to 
@jkylling.
   - Unblocks apache/datafusion-comet#3432 (remove native_datafusion fallback 
for Spark's `_tmp_metadata_row_index`).
   
   ## Rationale for this change
   
   arrow-rs 57.1.0+ supports Parquet virtual columns (`row_number`, 
`row_group_index`) via `ArrowReaderOptions::with_virtual_columns`, and 
DataFusion pins a new-enough arrow-rs for the API to be available. DataFusion 
does not yet plumb the option through `ParquetOpener`, so consumers (notably 
Comet) cannot project Spark's `_tmp_metadata_row_index` through the 
native_datafusion scan path.
   
   This PR adds the minimal opener-boundary plumbing so `TableSchema` can carry 
virtual columns and the Parquet reader produces them. UX / SQL-layer surface 
for virtual columns stays deferred to the epic in #20135 — this follows the 
same framing alamb blessed for #20071 (the `input_file_name()` UDF).
   
   ## What changes are included in this PR?
   
   - `TableSchema::with_virtual_columns(...)` builder + `virtual_columns()` 
getter. Layout: `[file, partition, virtual]`. Composable with 
`with_table_partition_cols` in either order.
   - `ParquetOpener` forwards the fields to 
`ArrowReaderOptions::with_virtual_columns`; augments the schemas passed to the 
expr-adapter / simplifier with virtual fields so virtual-col refs 
identity-rewrite; strips them from the projection fed to 
`ProjectionMask::roots` (which only understands file columns) and appends them 
to `stream_schema` so `reassign_expr_columns` resolves them by name.
   - An allowlist (`SUPPORTED_VIRTUAL_EXTENSION_TYPES`) gates which arrow-rs 
virtual extension types are accepted. Currently only `RowNumber`. 
`RowGroupIndex` will land in a follow-up with its own test, so a future 
arrow-rs release can't silently add untested behavior.
   - `arrow-schema` added as a direct dep (previously transitive via `arrow`) 
so the allowlist references `RowNumber::NAME` from arrow-rs instead of 
hardcoding the string.
   - Explicitly **not** in scope (follow-ups): `ListingTable` / SQL-layer 
surface, a three-arg constructor on `TableSchema`, 
`ParquetSource::with_virtual_columns`, and `RowGroupIndex` support.
   
   ## Are these changes tested?
   
   Yes. New unit tests in `opener.rs`:
   
   - `test_row_index_basic` — single row group, select data + row_number.
   - `test_row_index_projection_only` — select only row_number.
   - `test_row_index_multi_row_group` — 3 × 100 rows, verify absolute 0..300 
across boundaries.
   - `test_row_index_with_row_group_skip` — predicate stats-prunes the middle 
row group; verify row numbers stay absolute (0..100 ++ 200..300). Critical 
correctness gate for Spark (and for apache/arrow-rs#8863).
   - `test_row_index_with_partition_cols` — partition + virtual + data columns 
compose correctly.
   - `test_row_index_nullable_int64` — nullability flag flows through unchanged 
(matches Spark's `_tmp_metadata_row_index` declaration).
   - `test_unsupported_virtual_extension_type_rejected` — using `RowGroupIndex` 
(a real arrow-rs type deliberately not in the allowlist) errors with 
`NotImplemented` instead of silently forwarding.
   
   Plus a `TableSchema` unit test verifying the `[file, partition, virtual]` 
layout is stable regardless of builder-call order.
   
   ## Are there any user-facing changes?
   
   Public API additions on `TableSchema`: `with_virtual_columns(...)` and 
`virtual_columns()`. No existing API changed; no breaking changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] feat: Plumb Parquet virtual columns (row_number) through TableSchema and ParquetOpener [datafusion]

Reply via email to