I don't think Spark can do this with its current architecture. It has to
wait for the step to be done, speculative execution isn't possible. Others
probably know more about why that is.

Thanks,
Russell Jurney @rjurney <http://twitter.com/rjurney>
russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
<http://facebook.com/jurney> datasyndrome.com


On Wed, Sep 7, 2022 at 7:42 AM Sungwoo Park <glap...@gmail.com> wrote:

> Hello Spark users,
>
> I have a question on the architecture of Spark (which could lead to a
> research problem). In its current implementation, Spark finishes executing
> all the tasks in a stage before proceeding to child stages. For example,
> given a two-stage map-reduce DAG, Spark finishes executing all the map
> tasks before scheduling reduce tasks.
>
> We can think of another 'pipelined execution' strategy in which tasks in
> child stages can be scheduled and executed concurrently with tasks in
> parent stages. For example, for the two-stage map-reduce DAG, while map
> tasks are being executed, we could schedule and execute reduce tasks in
> advance if the cluster has enough resources. These reduce tasks can also
> pre-fetch the output of map tasks.
>
> Has anyone seen Spark jobs for which this 'pipelined execution' strategy
> would be desirable while the current implementation is not quite adequate?
> Since Spark tasks usually run for a short period of time, I guess the new
> strategy would not have a major performance improvement. However, there
> might be some category of Spark jobs for which this new strategy would be
> clearly a better choice.
>
> Thanks,
>
> --- Sungwoo
>
>

Reply via email to