I tried the Parallel Iterable + ThreadPools.getWorkerPool approach and ran the benchmark again.
For 1000 files - 15-20 Kb each Overhead (ms) Async (s) Sync (s) (existing) % Improvement No manual Overhead 0.855 0.842 -1.5% 1 0.863 2.674 67.7% 5 1.172 9.270 87.4% 10 1.670 15.629 89.3% 15 2.145 21.100 89.8%Only concern here is when no manual overhead consider async actually is taking bit extra time to manage threads via this implementation but as IO overheads increases (which is common for cloud connections) betterment is nearly 90% PR : https://github.com/apache/iceberg/pull/15341 detailed benchmarking : https://docs.google.com/document/d/17vBz5t-gSDdmB0S40MYRceyvmcBSzw9Gii-FcU97Lds/edit?usp=sharing On Thu, Feb 12, 2026 at 12:44 PM Varun Lakhyani <[email protected]> wrote: > Hello All, > > I’d like to start a discussion around adding Asynchronous capability to > Spark readers by making them capable to run parallel tasks especially when > large numbers of small files are involved. > Currently, readers are based on BaseReader.next() where each task is > opened, fully consumed, and closed before moving on to the next one. > > With workloads containing hundreds or thousands of small files (for > example, 4–10 KB files), this sequential behavior can introduce significant > overhead. Each file is opened independently, and the reader waits for one > task to be fully consumed before opening the next. Here more CPU idleness > can also be a major issue. > > One possible improvement is to optionally allow Spark readers to function > asynchronously for scans dominated by many small files. > At a high level, the idea would be to: > > - Open multiple small-file scan tasks concurrently, Read from them > asynchronously or in parallel and stitch their output into a single > buffered iterator or stream for downstream processing > > The existing sequential behavior would remain the default, with this mode > being opt-in or conditionally enabled for small-file-heavy workloads. > This could benefit several Iceberg use cases, including compaction or > cleanup jobs. > > *My Question* > > - Are there known constraints in Spark’s task execution model that > would make this approach problematic? > - Is it suitable if I plan a proposal for this idea and work around it? > > I’ve opened a related issue [1] to capture the problem statement and > initial thoughts: > Any feedback, pointers to prior discussions, or guidance on would be very > helpful. > > [1] Github issue - https://github.com/apache/iceberg/issues/15287 > -- > Lakhyani Varun > Indian Institute of Technology Roorkee > Contact: +91 96246 46174 > >
