Re: [PR] feat: Plumb Parquet virtual columns (row_number) through TableSchema and ParquetOpener [datafusion]

via GitHub Thu, 07 May 2026 09:08:43 -0700


comphead commented on PR #22026:
URL: https://github.com/apache/datafusion/pull/22026#issuecomment-4398834689


   I made hybrid(genAI + manual) initial review
   
   ```
   P1. Per-file validation on morselizer-level constants                        
                                                                                
                                            
      
     opener.rs:546,550 — validate_supported_virtual_columns and 
validate_predicate_does_not_reference_virtual_columns run inside 
prepare_open_file (once per file). Both are pure functions of                
     self.table_schema.virtual_columns() and self.predicate, which are fixed 
for the morselizer's lifetime. In a scan over 10K files this is 10K redundant 
tree-walks of the predicate and 10K HashSet<&str>
     rebuilds (opener.rs:1440-1441).                                            
                                                                                
                                              
                     
     Fix: hoist both checks into ParquetMorselizerBuilder::build() (or wherever 
the morselizer becomes immutable). Per-file cost drops to zero.                 
                                              
      
     P2. Reallocating null_replacements and helper schemas per file             
                                                                                
                                              
                     
     - opener.rs:1187-1193 — null_replacements: HashMap<String, ScalarValue> is 
built per build_stream call. Virtual columns are static across files, so this 
map could live on the morselizer.               
     - opener.rs:849,851,1244 — three append_fields(...) calls that allocate 
fresh Schemas per file (logical-for-rewrite, physical-for-rewrite, 
stream_schema). Could be computed once when virtual cols are
     fixed.                                                                     
                                                                                
                                              
                     
     These are all cold-path (setup-per-file), so it's minor — but gratis if 
you refactor P1.                                                                
                                                 
                     
     P3. virtual_columns: Vec<FieldRef> cloned into PreparedParquetOpen         
                                                                                
                                              
                     
     opener.rs:663 — self.table_schema.virtual_columns().clone() clones the Vec 
per file. TableSchema already stores Arc<Vec<FieldRef>> internally, but the 
getter returns &Vec<FieldRef>. Either return      
     &Arc<Vec<FieldRef>> or store the Arc in PreparedParquetOpen. Consistent 
with the pre-existing partition-cols pattern, so not strictly a regression.
                                                                                
                                                                                
                                              
     P4. schema_without_virtual_columns rebuilds on every try_pushdown_filters 
call                                                                            
                                               
      
     table_schema.rs:275-280 — allocates a fresh SchemaRef each call. It's a 
pure function of immutable fields; memoize it next to table_schema (same 
pattern the struct already uses for table_schema).      
     Low-frequency call (planning-time), so low priority.
   ```
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: Plumb Parquet virtual columns (row_number) through TableSchema and ParquetOpener [datafusion]

Reply via email to