alamb commented on issue #11567: URL: https://github.com/apache/datafusion/issues/11567#issuecomment-2258049992
I looked at some short queries and found one potential improvement https://github.com/apache/datafusion/issues/11719 I also looked at Q38 ```sql SELECT "URL", COUNT(*) AS PageViews FROM hits WHERE "CounterID" = 62 AND "EventDate"::INT::DATE >= '2013-07-01' AND "EventDate"::INT::DATE <= '2013-07-31' AND "IsRefresh" = 0 AND "IsLink" <> 0 AND "IsDownload" = 0 GROUP BY "URL" ORDER BY PageViews DESC LIMIT 10 OFFSET 1000; ``` ```shell $ cargo run --release --bin dfbench -- clickbench --iterations 100 --path benchmarks/data/hits_partitioned --query 38 ``` More than 50% of the time is spent doing snappy decoding (which we aren't likely to be able to improve) <img width="1728" alt="Screenshot 2024-07-30 at 6 40 44 AM" src="https://github.com/user-attachments/assets/a1e53db1-6e67-4014-b4f5-77308a581c76"> 12% of the time is reading string data from parquet (maybe stringview will help) 10% of the time is spent decoding parquet metadata <img width="1728" alt="Screenshot 2024-07-30 at 6 44 17 AM" src="https://github.com/user-attachments/assets/c95aac55-0cac-4be7-a981-a3e3ce8c79ac"> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org