mbutrovich commented on issue #20135: URL: https://github.com/apache/datafusion/issues/20135#issuecomment-4381652523
I accidentally put my comment on #20132 instead of here. Too many tabs open, copying: > I'd like to open a PR that lands a subset of the work [@jkylling](https://github.com/jkylling) prototyped in [#20133](https://github.com/apache/datafusion/pull/20133), as physical-layer plumbing below this epic. > > **What I'd build.** Extend `TableSchema` with a `virtual_columns: Vec<FieldRef>` list. Thread it into `ParquetOpener` via `ArrowReaderOptions::with_virtual_columns(...)`, handling the three arrow-rs edges (predicate-pushdown schema, projection mask, output stream schema) the same way [#20133](https://github.com/apache/datafusion/pull/20133) did. No changes to `ListingTable`, SQL planning, `information_schema`, or any `_metadata` surface. > > **Approach.** The core logic would follow [@jkylling](https://github.com/jkylling)'s [#20133](https://github.com/apache/datafusion/pull/20133) with credit in the commit message and PR description, narrower in scope and focused on what downstream physical-plan consumers need today. > > **Why this is distinct from [#20071](https://github.com/apache/datafusion/pull/20071).** The UDF-rewrite pattern in [#20071](https://github.com/apache/datafusion/pull/20071) works for file-constant metadata (`input_file_name`, modification time) because the opener substitutes a literal. It cannot express reader-generated virtual columns like row number or row group index, whose values change per row and are produced by the parquet reader itself via `ArrowReaderOptions::with_virtual_columns`. The two approaches are complementary, not overlapping. > > **Why it does not preempt this epic.** Any UX you settle on (relation-scoped hidden columns, `_metadata` struct, scalar UDFs) will need some way for the opener to produce these columns. `TableSchema::with_virtual_columns` is that seam. It makes no commitment to visibility, collision, or naming policy. > > **Who is ready to consume it today.** > > * Apache DataFusion Comet ([[native_datafusion] Add support for reading row index metadata columnsĀ datafusion-comet#3432](https://github.com/apache/datafusion-comet/issues/3432)): the native scan currently falls back to Spark whenever `_tmp_metadata_row_index` is requested. Comet builds physical plans directly and does not need the SQL UX. > * Vortex (per [@AdamGS](https://github.com/AdamGS) above). > > [#20071](https://github.com/apache/datafusion/pull/20071) landed the same shape of change (minimal opener-boundary plumbing, deferring UX to this epic) with approval from [@alamb](https://github.com/alamb). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
