Tushar7012 opened a new pull request, #20023:
URL: https://github.com/apache/datafusion/pull/20023

   ## Which issue does this PR close?
   
   - Part of improving DataFusion's file listing performance for large-scale 
table scans.
   
   ## Rationale for this change
   
   When a `ListingTable` has multiple table paths, the current implementation 
processes them sequentially using `future::try_join_all`. This can be a 
bottleneck when listing files across many directories, as each path's file 
listing waits for the previous one to complete before starting.
   
   By parallelizing the file listing using `tokio::task::JoinSet`, we can spawn 
separate tasks for each table path, allowing them to execute concurrently and 
significantly improving performance for tables with multiple paths.
   
   ## What changes are included in this PR?
   
   1. **Parallel file listing with `JoinSet`** - Modified `list_files_for_scan` 
to spawn parallel tasks for each table path using `tokio::task::JoinSet`
   
   2. **Function signature refactoring** - Updated `pruned_partition_list` and 
`list_all_files` to accept `&ConfigOptions` and `&Arc<RuntimeEnv>` instead of 
`&dyn Session` to enable cloning for parallel task spawning
   
   3. **WASM compatibility** - Added conditional compilation 
(`#[cfg(not(target_arch = "wasm32"))]`) to use parallel execution for native 
targets and sequential execution with `try_join_all` for WASM targets, since 
WASM has limited multi-threading support
   
   ### Files changed:
   - `datafusion/catalog-listing/src/table.rs` - Main parallelization logic
   - `datafusion/catalog-listing/src/helpers.rs` - Updated function signatures
   - `datafusion/catalog-listing/src/options.rs` - Updated function signatures
   - `datafusion/datasource/src/url.rs` - Updated `list_all_files` signature
   - `datafusion/core/src/datasource/listing/table.rs` - Updated call sites
   - `datafusion/core/tests/catalog_listing/pruned_partition_list.rs` - Updated 
test calls
   
   ## Are these changes tested?
   
   Yes, the existing tests cover the functionality:
   - `pruned_partition_list` tests validate the file listing behavior
   - WASM tests ensure compatibility with WebAssembly target
   - CI runs include both native and WASM build tests
   
   ## Are there any user-facing changes?
   
   No user-facing API changes. This is an internal performance optimization 
that maintains the same external behavior while improving file listing 
performance for tables with multiple paths.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to