alamb opened a new issue, #21554:
URL: https://github.com/apache/datafusion/issues/21554

   ### Is your feature request related to a problem or challenge?
   
   In the parquet opener, DataFusion currently does per-file schema adaptation 
and pruning setup, including predicate rewrites and pruning predicate 
construction:
   - 
https://github.com/apache/datafusion/blob/590a5178c8ffb17873f612a9c1da234fc1a18ff3/datafusion/datasource-parquet/src/opener.rs#L743-L788
   - 
https://github.com/apache/datafusion/blob/590a5178c8ffb17873f612a9c1da234fc1a18ff3/datafusion/datasource-parquet/src/opener.rs#L1523-L1547
   
   As @adriangb noted on 
https://github.com/apache/datafusion/pull/21480#issuecomment-4215673477, many 
deployments only have a small number of physical schemas, often just one, so 
repeating the same work across many files is wasteful.
   
   PR #21480 from @fpetkovski improved this area by avoiding page pruning 
predicate construction unless page indexes are enabled, but there still seems 
to be a follow-on opportunity to cache equivalent pruning setup across files 
with the same physical schema.
   
   ### Describe the solution you'd like
   
   Cache parquet pruning setup across files when the physical schema and other 
correctness-relevant inputs are the same.
   
   This likely includes:
   - expression/schema rewrite results
   - pruning predicate construction
   - page pruning predicate construction where applicable
   
   ### Describe alternatives you've considered
   
   Continue with smaller local optimizations like PR #21480, or add more 
one-off fast paths. Those help, but caching shared setup seems like the more 
direct way to avoid repeated work.
   
   ### Additional context
   
   Relevant links:
   - Tracking comment from @adriangb:
     https://github.com/apache/datafusion/pull/21480#issuecomment-4215673477
   - Original PR from @fpetkovski:
     https://github.com/apache/datafusion/pull/21480
   - Page index loading / page pruning setup:
     
https://github.com/apache/datafusion/blob/590a5178c8ffb17873f612a9c1da234fc1a18ff3/datafusion/datasource-parquet/src/opener.rs#L793-L839
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to