mbutrovich commented on PR #22026: URL: https://github.com/apache/datafusion/pull/22026#issuecomment-4400142871
Addressed @comphead's [review](https://github.com/apache/datafusion/pull/22026#issuecomment-4398834689): - **P1 + P2 + P3:** Introduced `VirtualColumnsState` (`Arc`-shared, holds validated fields, `null_replacements`, and the logical-with-virtual schema). Built once per scan partition in `ParquetSource::create_morselizer`; stored as `Option<Arc<VirtualColumnsState>>` on `ParquetMorselizer` and `PreparedParquetOpen`. - **Per-file cost** for virtual-column scans drops to one `Arc::clone`. The two remaining per-file `append_fields` calls (`physical_for_rewrite`, `stream_schema`) depend on per-file coercions/projection mask and can't be cached. - **P4 skipped:** adding `OnceLock<SchemaRef>` to every `TableSchema` to save a one-shot `Vec` iteration on a planning-time path is not a necessary compute-vs-memory trade. - **[opener.rs:547](https://github.com/apache/datafusion/pull/22026#discussion_r3202953589):** Call site moved into `create_morselizer` with an inline comment explaining why predicate validation gates on `pushdown_filters` (when pushdown is off, the predicate stays above the scan as a `FilterExec` and resolves virtual columns there). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
