Re: [DISCUSS] Exploring parallel task execution in Spark readers

Varun Lakhyani Tue, 07 Apr 2026 07:38:07 -0700

Update on this - I have proposed this as a GSoC 2026 idea.
I ran a couple of benchmarkings:


Mode

Time (s/op)

Async (true)

78.319

Sync (false)

301.904

Metric

Async (true)

Sync (false)

CPU Time

12.4 s

14.5 s

Peak Heap

425.4 MB

418.3 MB

Total Allocated

8.5 MB/iter

8.9 MB/iter

GC Time

70.7 ms

53.5 ms

GC Count

14.1

11.2

This shows ~74% improvement in execution time, with similar memory usage
and slightly higher GC activity

Next, I plan to explore a more targeted async approach (e.g., parallelizing
only network calls instead of the entire open function) and also study
parallel implementations in other projects like Datafusion Comet.

On Thu, Feb 12, 2026 at 12:44 PM Varun Lakhyani <[email protected]>
wrote:

> Hello All,
>
> I’d like to start a discussion around adding Asynchronous capability to
> Spark readers by making them capable to run parallel tasks especially when
> large numbers of small files are involved.
> Currently, readers are based on BaseReader.next() where each task is
> opened, fully consumed, and closed before moving on to the next one.
>
> With workloads containing hundreds or thousands of small files (for
> example, 4–10 KB files), this sequential behavior can introduce significant
> overhead. Each file is opened independently, and the reader waits for one
> task to be fully consumed before opening the next. Here more CPU idleness
> can also be a major issue.
>
> One possible improvement is to optionally allow Spark readers to function
> asynchronously for scans dominated by many small files.
> At a high level, the idea would be to:
>
>    - Open multiple small-file scan tasks concurrently, Read from them
>    asynchronously or in parallel and stitch their output into a single
>    buffered iterator or stream for downstream processing
>
> The existing sequential behavior would remain the default, with this mode
> being opt-in or conditionally enabled for small-file-heavy workloads.
> This could benefit several Iceberg use cases, including compaction or
> cleanup jobs.
>
> *My Question*
>
>    - Are there known constraints in Spark’s task execution model that
>    would make this approach problematic?
>    - Is it suitable if I plan a proposal for this idea and work around it?
>
> I’ve opened a related issue [1] to capture the problem statement and
> initial thoughts:
> Any feedback, pointers to prior discussions, or guidance on would be very
> helpful.
>
> [1] Github issue - https://github.com/apache/iceberg/issues/15287
> --
> Lakhyani Varun
> Indian Institute of Technology Roorkee
> Contact: +91 96246 46174
>
>

Re: [DISCUSS] Exploring parallel task execution in Spark readers

Reply via email to