I could be wrong , but… just start it. If you have the capacity, it takes a
lot of time on large datasets to reduce the entire dataset. If you have the
resources, start combining and reducing on partial map results. As soon as
you’ve got one record out of the map, it has a reduce key in the plan, so
send it to that reducer. You can’t finish the reduce until you’re done with
the map, but you can start it immediately. This depends on reducers being
algebraic, of course, and learning to think in MapReduce isn’t even
possible for a lot of people. Some people say it is impossible to do it
well but I disagree :)

On Wed, Sep 7, 2022 at 3:51 PM Sean Owen <sro...@gmail.com> wrote:

> Wait, how do you start reduce tasks before maps are finished? is the idea
> that some reduce tasks don't depend on all the maps, or at least you can
> get started?
> You can already execute unrelated DAGs in parallel of course.
>
> On Wed, Sep 7, 2022 at 5:49 PM Sungwoo Park <glap...@gmail.com> wrote:
>
>> You are right -- Spark can't do this with its current architecture. My
>> question was: if there was a new implementation supporting pipelined
>> execution, what kind of Spark jobs would benefit (a lot) from it?
>>
>> Thanks,
>>
>> --- Sungwoo
>>
>> On Thu, Sep 8, 2022 at 1:47 AM Russell Jurney <russell.jur...@gmail.com>
>> wrote:
>>
>>> I don't think Spark can do this with its current architecture. It has to
>>> wait for the step to be done, speculative execution isn't possible. Others
>>> probably know more about why that is.
>>>
>>> Thanks,
>>> Russell Jurney @rjurney <http://twitter.com/rjurney>
>>> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
>>> <http://facebook.com/jurney> datasyndrome.com
>>>
>>>
>>> On Wed, Sep 7, 2022 at 7:42 AM Sungwoo Park <glap...@gmail.com> wrote:
>>>
>>>> Hello Spark users,
>>>>
>>>> I have a question on the architecture of Spark (which could lead to a
>>>> research problem). In its current implementation, Spark finishes executing
>>>> all the tasks in a stage before proceeding to child stages. For example,
>>>> given a two-stage map-reduce DAG, Spark finishes executing all the map
>>>> tasks before scheduling reduce tasks.
>>>>
>>>> We can think of another 'pipelined execution' strategy in which tasks
>>>> in child stages can be scheduled and executed concurrently with tasks in
>>>> parent stages. For example, for the two-stage map-reduce DAG, while map
>>>> tasks are being executed, we could schedule and execute reduce tasks in
>>>> advance if the cluster has enough resources. These reduce tasks can also
>>>> pre-fetch the output of map tasks.
>>>>
>>>> Has anyone seen Spark jobs for which this 'pipelined execution'
>>>> strategy would be desirable while the current implementation is not quite
>>>> adequate? Since Spark tasks usually run for a short period of time, I guess
>>>> the new strategy would not have a major performance improvement. However,
>>>> there might be some category of Spark jobs for which this new strategy
>>>> would be clearly a better choice.
>>>>
>>>> Thanks,
>>>>
>>>> --- Sungwoo
>>>>
>>>> --

Thanks,
Russell Jurney @rjurney <http://twitter.com/rjurney>
russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
<http://facebook.com/jurney> datasyndrome.com

Reply via email to