You want to talk to Chris Wensel, creator of cascading, a system that did
speculative execution for a large volume of enterprise workloads. It was
the first approachable way to scale workloads using Hadoop. He could write
a book about this topic. Happy to introduce you if you’d like, or you could
ask on the cascading user group.

https://cascading.wensel.net/

On Wed, Sep 7, 2022 at 3:49 PM Sungwoo Park <glap...@gmail.com> wrote:

> You are right -- Spark can't do this with its current architecture. My
> question was: if there was a new implementation supporting pipelined
> execution, what kind of Spark jobs would benefit (a lot) from it?
>
> Thanks,
>
> --- Sungwoo
>
> On Thu, Sep 8, 2022 at 1:47 AM Russell Jurney <russell.jur...@gmail.com>
> wrote:
>
>> I don't think Spark can do this with its current architecture. It has to
>> wait for the step to be done, speculative execution isn't possible. Others
>> probably know more about why that is.
>>
>> Thanks,
>> Russell Jurney @rjurney <http://twitter.com/rjurney>
>> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
>> <http://facebook.com/jurney> datasyndrome.com
>>
>>
>> On Wed, Sep 7, 2022 at 7:42 AM Sungwoo Park <glap...@gmail.com> wrote:
>>
>>> Hello Spark users,
>>>
>>> I have a question on the architecture of Spark (which could lead to a
>>> research problem). In its current implementation, Spark finishes executing
>>> all the tasks in a stage before proceeding to child stages. For example,
>>> given a two-stage map-reduce DAG, Spark finishes executing all the map
>>> tasks before scheduling reduce tasks.
>>>
>>> We can think of another 'pipelined execution' strategy in which tasks in
>>> child stages can be scheduled and executed concurrently with tasks in
>>> parent stages. For example, for the two-stage map-reduce DAG, while map
>>> tasks are being executed, we could schedule and execute reduce tasks in
>>> advance if the cluster has enough resources. These reduce tasks can also
>>> pre-fetch the output of map tasks.
>>>
>>> Has anyone seen Spark jobs for which this 'pipelined execution' strategy
>>> would be desirable while the current implementation is not quite adequate?
>>> Since Spark tasks usually run for a short period of time, I guess the new
>>> strategy would not have a major performance improvement. However, there
>>> might be some category of Spark jobs for which this new strategy would be
>>> clearly a better choice.
>>>
>>> Thanks,
>>>
>>> --- Sungwoo
>>>
>>> --

Thanks,
Russell Jurney @rjurney <http://twitter.com/rjurney>
russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
<http://facebook.com/jurney> datasyndrome.com

Reply via email to