mbutrovich commented on issue #20135:
URL: https://github.com/apache/datafusion/issues/20135#issuecomment-4381652523

   I accidentally put my comment on #20132 instead of here. Too many tabs open, 
copying:
   
   > I'd like to open a PR that lands a subset of the work 
[@jkylling](https://github.com/jkylling) prototyped in 
[#20133](https://github.com/apache/datafusion/pull/20133), as physical-layer 
plumbing below this epic.
   > 
   > **What I'd build.** Extend `TableSchema` with a `virtual_columns: 
Vec<FieldRef>` list. Thread it into `ParquetOpener` via 
`ArrowReaderOptions::with_virtual_columns(...)`, handling the three arrow-rs 
edges (predicate-pushdown schema, projection mask, output stream schema) the 
same way [#20133](https://github.com/apache/datafusion/pull/20133) did. No 
changes to `ListingTable`, SQL planning, `information_schema`, or any 
`_metadata` surface.
   > 
   > **Approach.** The core logic would follow 
[@jkylling](https://github.com/jkylling)'s 
[#20133](https://github.com/apache/datafusion/pull/20133) with credit in the 
commit message and PR description, narrower in scope and focused on what 
downstream physical-plan consumers need today.
   > 
   > **Why this is distinct from 
[#20071](https://github.com/apache/datafusion/pull/20071).** The UDF-rewrite 
pattern in [#20071](https://github.com/apache/datafusion/pull/20071) works for 
file-constant metadata (`input_file_name`, modification time) because the 
opener substitutes a literal. It cannot express reader-generated virtual 
columns like row number or row group index, whose values change per row and are 
produced by the parquet reader itself via 
`ArrowReaderOptions::with_virtual_columns`. The two approaches are 
complementary, not overlapping.
   > 
   > **Why it does not preempt this epic.** Any UX you settle on 
(relation-scoped hidden columns, `_metadata` struct, scalar UDFs) will need 
some way for the opener to produce these columns. 
`TableSchema::with_virtual_columns` is that seam. It makes no commitment to 
visibility, collision, or naming policy.
   > 
   > **Who is ready to consume it today.**
   > 
   > * Apache DataFusion Comet ([[native_datafusion] Add support for reading 
row index metadata columnsĀ 
datafusion-comet#3432](https://github.com/apache/datafusion-comet/issues/3432)):
 the native scan currently falls back to Spark whenever 
`_tmp_metadata_row_index` is requested. Comet builds physical plans directly 
and does not need the SQL UX.
   > * Vortex (per [@AdamGS](https://github.com/AdamGS) above).
   > 
   > [#20071](https://github.com/apache/datafusion/pull/20071) landed the same 
shape of change (minimal opener-boundary plumbing, deferring UX to this epic) 
with approval from [@alamb](https://github.com/alamb).
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to