Re: [I] Parallelize `list_files_for_scan` [datafusion]

via GitHub Mon, 26 Jan 2026 12:56:06 -0800


Tushar7012 commented on issue #19971:
URL: https://github.com/apache/datafusion/issues/19971#issuecomment-3801716177


   Thanks for the insight @Dandandan!
   
   I've implemented the parallelization in 
[list_files_for_scan](cci:1://file:///d:/Agentic_AI/Gssoc_Apache/datafusion/datafusion/catalog-listing/src/table.rs:697:4-832:5)
 using `tokio::task::JoinSet`.
   
   Each 
[pruned_partition_list](cci:1://file:///d:/Agentic_AI/Gssoc_Apache/datafusion/datafusion/catalog-listing/src/helpers.rs:366:0-419:1)
 call (which handles the listing and the CPU-intensive metadata/statistics 
processing you mentioned) is now spawned as a separate Tokio task. This allows 
us to spread the deserialization and processing work across multiple threads, 
which should directly address the bottleneck observed in the profile 
significantly improving performance for tables with many partitions or files.
   
   I've also refactored 
[ListingTableUrl](cci:2://file:///d:/Agentic_AI/Gssoc_Apache/datafusion/datafusion/datasource/src/url.rs:39:0-48:1)
 to pass `ConfigOptions` and `Arc<RuntimeEnv>` explicitly, ensuring `Send` 
compliance for the spawned tasks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Parallelize `list_files_for_scan` [datafusion]

Reply via email to