comphead commented on PR #22026:
URL: https://github.com/apache/datafusion/pull/22026#issuecomment-4398834689
I made hybrid(genAI + manual) initial review
```
P1. Per-file validation on morselizer-level constants
opener.rs:546,550 — validate_supported_virtual_columns and
validate_predicate_does_not_reference_virtual_columns run inside
prepare_open_file (once per file). Both are pure functions of
self.table_schema.virtual_columns() and self.predicate, which are fixed
for the morselizer's lifetime. In a scan over 10K files this is 10K redundant
tree-walks of the predicate and 10K HashSet<&str>
rebuilds (opener.rs:1440-1441).
Fix: hoist both checks into ParquetMorselizerBuilder::build() (or wherever
the morselizer becomes immutable). Per-file cost drops to zero.
P2. Reallocating null_replacements and helper schemas per file
- opener.rs:1187-1193 — null_replacements: HashMap<String, ScalarValue> is
built per build_stream call. Virtual columns are static across files, so this
map could live on the morselizer.
- opener.rs:849,851,1244 — three append_fields(...) calls that allocate
fresh Schemas per file (logical-for-rewrite, physical-for-rewrite,
stream_schema). Could be computed once when virtual cols are
fixed.
These are all cold-path (setup-per-file), so it's minor — but gratis if
you refactor P1.
P3. virtual_columns: Vec<FieldRef> cloned into PreparedParquetOpen
opener.rs:663 — self.table_schema.virtual_columns().clone() clones the Vec
per file. TableSchema already stores Arc<Vec<FieldRef>> internally, but the
getter returns &Vec<FieldRef>. Either return
&Arc<Vec<FieldRef>> or store the Arc in PreparedParquetOpen. Consistent
with the pre-existing partition-cols pattern, so not strictly a regression.
P4. schema_without_virtual_columns rebuilds on every try_pushdown_filters
call
table_schema.rs:275-280 — allocates a fresh SchemaRef each call. It's a
pure function of immutable fields; memoize it next to table_schema (same
pattern the struct already uses for table_schema).
Low-frequency call (planning-time), so low priority.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]