Tushar7012 commented on issue #19971: URL: https://github.com/apache/datafusion/issues/19971#issuecomment-3801716177
Thanks for the insight @Dandandan! I've implemented the parallelization in [list_files_for_scan](cci:1://file:///d:/Agentic_AI/Gssoc_Apache/datafusion/datafusion/catalog-listing/src/table.rs:697:4-832:5) using `tokio::task::JoinSet`. Each [pruned_partition_list](cci:1://file:///d:/Agentic_AI/Gssoc_Apache/datafusion/datafusion/catalog-listing/src/helpers.rs:366:0-419:1) call (which handles the listing and the CPU-intensive metadata/statistics processing you mentioned) is now spawned as a separate Tokio task. This allows us to spread the deserialization and processing work across multiple threads, which should directly address the bottleneck observed in the profile significantly improving performance for tables with many partitions or files. I've also refactored [ListingTableUrl](cci:2://file:///d:/Agentic_AI/Gssoc_Apache/datafusion/datafusion/datasource/src/url.rs:39:0-48:1) to pass `ConfigOptions` and `Arc<RuntimeEnv>` explicitly, ensuring `Send` compliance for the spawned tasks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
