Re: [DISCUSS] Exploring parallel task execution in Spark readers

Steve Loughran Thu, 12 Feb 2026 07:04:38 -0800

for an object store, overlapping the GET of the next file with the
processing of the first would maximise CPU use, there'd be no conflicting
demand for the core, just an http request issued and awaiting a response on
one thread while the main cpu carries on its work.


calling InputFile.newStream() async would be enough to start, though any
cloud connector doing lazy GET calls would be postponing any/all IO until
the first reads take place...

On Thu, 12 Feb 2026 at 07:14, Varun Lakhyani <[email protected]>
wrote:

> Hello All,
>
> I’d like to start a discussion around adding Asynchronous capability to
> Spark readers by making them capable to run parallel tasks especially when
> large numbers of small files are involved.
> Currently, readers are based on BaseReader.next() where each task is
> opened, fully consumed, and closed before moving on to the next one.
>
> With workloads containing hundreds or thousands of small files (for
> example, 4–10 KB files), this sequential behavior can introduce significant
> overhead. Each file is opened independently, and the reader waits for one
> task to be fully consumed before opening the next. Here more CPU idleness
> can also be a major issue.
>
> One possible improvement is to optionally allow Spark readers to function
> asynchronously for scans dominated by many small files.
> At a high level, the idea would be to:
>
>    - Open multiple small-file scan tasks concurrently, Read from them
>    asynchronously or in parallel and stitch their output into a single
>    buffered iterator or stream for downstream processing
>
> The existing sequential behavior would remain the default, with this mode
> being opt-in or conditionally enabled for small-file-heavy workloads.
> This could benefit several Iceberg use cases, including compaction or
> cleanup jobs.
>
> *My Question*
>
>    - Are there known constraints in Spark’s task execution model that
>    would make this approach problematic?
>    - Is it suitable if I plan a proposal for this idea and work around it?
>
> I’ve opened a related issue [1] to capture the problem statement and
> initial thoughts:
> Any feedback, pointers to prior discussions, or guidance on would be very
> helpful.
>
> [1] Github issue - https://github.com/apache/iceberg/issues/15287
> --
> Lakhyani Varun
> Indian Institute of Technology Roorkee
> Contact: +91 96246 46174
>
>

Re: [DISCUSS] Exploring parallel task execution in Spark readers

Reply via email to