You want to talk to Chris Wensel, creator of cascading, a system that did speculative execution for a large volume of enterprise workloads. It was the first approachable way to scale workloads using Hadoop. He could write a book about this topic. Happy to introduce you if you’d like, or you could ask on the cascading user group.
https://cascading.wensel.net/ On Wed, Sep 7, 2022 at 3:49 PM Sungwoo Park <glap...@gmail.com> wrote: > You are right -- Spark can't do this with its current architecture. My > question was: if there was a new implementation supporting pipelined > execution, what kind of Spark jobs would benefit (a lot) from it? > > Thanks, > > --- Sungwoo > > On Thu, Sep 8, 2022 at 1:47 AM Russell Jurney <russell.jur...@gmail.com> > wrote: > >> I don't think Spark can do this with its current architecture. It has to >> wait for the step to be done, speculative execution isn't possible. Others >> probably know more about why that is. >> >> Thanks, >> Russell Jurney @rjurney <http://twitter.com/rjurney> >> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB >> <http://facebook.com/jurney> datasyndrome.com >> >> >> On Wed, Sep 7, 2022 at 7:42 AM Sungwoo Park <glap...@gmail.com> wrote: >> >>> Hello Spark users, >>> >>> I have a question on the architecture of Spark (which could lead to a >>> research problem). In its current implementation, Spark finishes executing >>> all the tasks in a stage before proceeding to child stages. For example, >>> given a two-stage map-reduce DAG, Spark finishes executing all the map >>> tasks before scheduling reduce tasks. >>> >>> We can think of another 'pipelined execution' strategy in which tasks in >>> child stages can be scheduled and executed concurrently with tasks in >>> parent stages. For example, for the two-stage map-reduce DAG, while map >>> tasks are being executed, we could schedule and execute reduce tasks in >>> advance if the cluster has enough resources. These reduce tasks can also >>> pre-fetch the output of map tasks. >>> >>> Has anyone seen Spark jobs for which this 'pipelined execution' strategy >>> would be desirable while the current implementation is not quite adequate? >>> Since Spark tasks usually run for a short period of time, I guess the new >>> strategy would not have a major performance improvement. However, there >>> might be some category of Spark jobs for which this new strategy would be >>> clearly a better choice. >>> >>> Thanks, >>> >>> --- Sungwoo >>> >>> -- Thanks, Russell Jurney @rjurney <http://twitter.com/rjurney> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB <http://facebook.com/jurney> datasyndrome.com