Update on this - I have proposed this as a GSoC 2026 idea. I ran a couple of benchmarkings:
Mode Time (s/op) Async (true) 78.319 Sync (false) 301.904 Metric Async (true) Sync (false) CPU Time 12.4 s 14.5 s Peak Heap 425.4 MB 418.3 MB Total Allocated 8.5 MB/iter 8.9 MB/iter GC Time 70.7 ms 53.5 ms GC Count 14.1 11.2 This shows ~74% improvement in execution time, with similar memory usage and slightly higher GC activity Next, I plan to explore a more targeted async approach (e.g., parallelizing only network calls instead of the entire open function) and also study parallel implementations in other projects like Datafusion Comet. On Thu, Feb 12, 2026 at 12:44 PM Varun Lakhyani <[email protected]> wrote: > Hello All, > > I’d like to start a discussion around adding Asynchronous capability to > Spark readers by making them capable to run parallel tasks especially when > large numbers of small files are involved. > Currently, readers are based on BaseReader.next() where each task is > opened, fully consumed, and closed before moving on to the next one. > > With workloads containing hundreds or thousands of small files (for > example, 4–10 KB files), this sequential behavior can introduce significant > overhead. Each file is opened independently, and the reader waits for one > task to be fully consumed before opening the next. Here more CPU idleness > can also be a major issue. > > One possible improvement is to optionally allow Spark readers to function > asynchronously for scans dominated by many small files. > At a high level, the idea would be to: > > - Open multiple small-file scan tasks concurrently, Read from them > asynchronously or in parallel and stitch their output into a single > buffered iterator or stream for downstream processing > > The existing sequential behavior would remain the default, with this mode > being opt-in or conditionally enabled for small-file-heavy workloads. > This could benefit several Iceberg use cases, including compaction or > cleanup jobs. > > *My Question* > > - Are there known constraints in Spark’s task execution model that > would make this approach problematic? > - Is it suitable if I plan a proposal for this idea and work around it? > > I’ve opened a related issue [1] to capture the problem statement and > initial thoughts: > Any feedback, pointers to prior discussions, or guidance on would be very > helpful. > > [1] Github issue - https://github.com/apache/iceberg/issues/15287 > -- > Lakhyani Varun > Indian Institute of Technology Roorkee > Contact: +91 96246 46174 > >
