Hello All,

I’d like to start a discussion around adding Asynchronous capability to
Spark readers by making them capable to run parallel tasks especially when
large numbers of small files are involved.
Currently, readers are based on BaseReader.next() where each task is
opened, fully consumed, and closed before moving on to the next one.

With workloads containing hundreds or thousands of small files (for
example, 4–10 KB files), this sequential behavior can introduce significant
overhead. Each file is opened independently, and the reader waits for one
task to be fully consumed before opening the next. Here more CPU idleness
can also be a major issue.

One possible improvement is to optionally allow Spark readers to function
asynchronously for scans dominated by many small files.
At a high level, the idea would be to:

   - Open multiple small-file scan tasks concurrently, Read from them
   asynchronously or in parallel and stitch their output into a single
   buffered iterator or stream for downstream processing

The existing sequential behavior would remain the default, with this mode
being opt-in or conditionally enabled for small-file-heavy workloads.
This could benefit several Iceberg use cases, including compaction or
cleanup jobs.

*My Question*

   - Are there known constraints in Spark’s task execution model that would
   make this approach problematic?
   - Is it suitable if I plan a proposal for this idea and work around it?

I’ve opened a related issue [1] to capture the problem statement and
initial thoughts:
Any feedback, pointers to prior discussions, or guidance on would be very
helpful.

[1] Github issue - https://github.com/apache/iceberg/issues/15287
--
Lakhyani Varun
Indian Institute of Technology Roorkee
Contact: +91 96246 46174

Reply via email to