alamb opened a new issue, #11719:
URL: https://github.com/apache/datafusion/issues/11719

   ### Is your feature request related to a problem or challenge?
   
   
   I spent some time looking at the ClickBench results with DataFusion 40.0.0 
   https://github.com/apache/datafusion/issues/11567#issuecomment-2254520675 
(thanks @pmcgleenon 🙏 )
   
   Specifically, I looked into how we could make some of the already fast 
queries on the the partitioned dataset faster. Unsurprisingly, for the really 
fast queries the query time is actually dominated by parquet metadata analysis 
and DataFusion statistics creation.
   
   For example
   
   ClickBench Q0
   ```
   SELECT COUNT(*) FROM hits;
   ```
   
   To reproduce, run:
   
   ```shell
   cd datafusion
   cargo run --release --bin dfbench -- clickbench --iterations 100 --path 
benchmarks/data/hits_partitioned  --query 0
   ```
   
   I profiled this using Instruments. Here are some annotated screenshots
   
   <img width="1728" alt="Screenshot 2024-07-30 at 6 25 43 AM" 
src="https://github.com/user-attachments/assets/28592700-dc3f-407b-9287-621c32290a53";>
   <img width="1728" alt="Screenshot 2024-07-30 at 6 26 53 AM" 
src="https://github.com/user-attachments/assets/3390a26d-f43f-4338-b92f-d681e3f2c378";>
   
   
   Some of my take aways are
   1. a substantial amount of time is spent reading the parquet metadata twice
   2. A substantial amount of time is spent managing the ScalarValues in 
statistics
   
   
   ### Describe the solution you'd like
   
   If would be cool to make these queries faster by reducing the per file 
metadata handling overhead (e.g. don't read the metadata more than once and 
figure out some way to make statistics handling more efficient)
   
   ### Describe alternatives you've considered
   
   Note this project isn't broken down into tasks yet
   
   I think @Ted-Jiang  did some work way back to cache parquet metaddata 
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to