Re: Pipelined execution in Spark (???)

2022-09-11 Thread Gourav Sengupta
Hi,

for some tasks as repartitionbyrange, it is indeed quite annoying sometimes
to wait for the maps to complete before reduce starts.

@Sean Owen   do you have any comments?

Regards,
Gourav Sengupta

On Thu, Sep 8, 2022 at 12:10 AM Russell Jurney 
wrote:

> I could be wrong , but… just start it. If you have the capacity, it takes
> a lot of time on large datasets to reduce the entire dataset. If you have
> the resources, start combining and reducing on partial map results. As soon
> as you’ve got one record out of the map, it has a reduce key in the plan,
> so send it to that reducer. You can’t finish the reduce until you’re done
> with the map, but you can start it immediately. This depends on reducers
> being algebraic, of course, and learning to think in MapReduce isn’t even
> possible for a lot of people. Some people say it is impossible to do it
> well but I disagree :)
>
> On Wed, Sep 7, 2022 at 3:51 PM Sean Owen  wrote:
>
>> Wait, how do you start reduce tasks before maps are finished? is the idea
>> that some reduce tasks don't depend on all the maps, or at least you can
>> get started?
>> You can already execute unrelated DAGs in parallel of course.
>>
>> On Wed, Sep 7, 2022 at 5:49 PM Sungwoo Park  wrote:
>>
>>> You are right -- Spark can't do this with its current architecture. My
>>> question was: if there was a new implementation supporting pipelined
>>> execution, what kind of Spark jobs would benefit (a lot) from it?
>>>
>>> Thanks,
>>>
>>> --- Sungwoo
>>>
>>> On Thu, Sep 8, 2022 at 1:47 AM Russell Jurney 
>>> wrote:
>>>
 I don't think Spark can do this with its current architecture. It has
 to wait for the step to be done, speculative execution isn't possible.
 Others probably know more about why that is.

 Thanks,
 Russell Jurney @rjurney 
 russell.jur...@gmail.com LI  FB
  datasyndrome.com


 On Wed, Sep 7, 2022 at 7:42 AM Sungwoo Park  wrote:

> Hello Spark users,
>
> I have a question on the architecture of Spark (which could lead to a
> research problem). In its current implementation, Spark finishes executing
> all the tasks in a stage before proceeding to child stages. For example,
> given a two-stage map-reduce DAG, Spark finishes executing all the map
> tasks before scheduling reduce tasks.
>
> We can think of another 'pipelined execution' strategy in which tasks
> in child stages can be scheduled and executed concurrently with tasks in
> parent stages. For example, for the two-stage map-reduce DAG, while map
> tasks are being executed, we could schedule and execute reduce tasks in
> advance if the cluster has enough resources. These reduce tasks can also
> pre-fetch the output of map tasks.
>
> Has anyone seen Spark jobs for which this 'pipelined execution'
> strategy would be desirable while the current implementation is not quite
> adequate? Since Spark tasks usually run for a short period of time, I 
> guess
> the new strategy would not have a major performance improvement. However,
> there might be some category of Spark jobs for which this new strategy
> would be clearly a better choice.
>
> Thanks,
>
> --- Sungwoo
>
> --
>
> Thanks,
> Russell Jurney @rjurney 
> russell.jur...@gmail.com LI  FB
>  datasyndrome.com
>


Re: Pipelined execution in Spark (???)

2022-09-07 Thread Russell Jurney
I could be wrong , but… just start it. If you have the capacity, it takes a
lot of time on large datasets to reduce the entire dataset. If you have the
resources, start combining and reducing on partial map results. As soon as
you’ve got one record out of the map, it has a reduce key in the plan, so
send it to that reducer. You can’t finish the reduce until you’re done with
the map, but you can start it immediately. This depends on reducers being
algebraic, of course, and learning to think in MapReduce isn’t even
possible for a lot of people. Some people say it is impossible to do it
well but I disagree :)

On Wed, Sep 7, 2022 at 3:51 PM Sean Owen  wrote:

> Wait, how do you start reduce tasks before maps are finished? is the idea
> that some reduce tasks don't depend on all the maps, or at least you can
> get started?
> You can already execute unrelated DAGs in parallel of course.
>
> On Wed, Sep 7, 2022 at 5:49 PM Sungwoo Park  wrote:
>
>> You are right -- Spark can't do this with its current architecture. My
>> question was: if there was a new implementation supporting pipelined
>> execution, what kind of Spark jobs would benefit (a lot) from it?
>>
>> Thanks,
>>
>> --- Sungwoo
>>
>> On Thu, Sep 8, 2022 at 1:47 AM Russell Jurney 
>> wrote:
>>
>>> I don't think Spark can do this with its current architecture. It has to
>>> wait for the step to be done, speculative execution isn't possible. Others
>>> probably know more about why that is.
>>>
>>> Thanks,
>>> Russell Jurney @rjurney 
>>> russell.jur...@gmail.com LI  FB
>>>  datasyndrome.com
>>>
>>>
>>> On Wed, Sep 7, 2022 at 7:42 AM Sungwoo Park  wrote:
>>>
 Hello Spark users,

 I have a question on the architecture of Spark (which could lead to a
 research problem). In its current implementation, Spark finishes executing
 all the tasks in a stage before proceeding to child stages. For example,
 given a two-stage map-reduce DAG, Spark finishes executing all the map
 tasks before scheduling reduce tasks.

 We can think of another 'pipelined execution' strategy in which tasks
 in child stages can be scheduled and executed concurrently with tasks in
 parent stages. For example, for the two-stage map-reduce DAG, while map
 tasks are being executed, we could schedule and execute reduce tasks in
 advance if the cluster has enough resources. These reduce tasks can also
 pre-fetch the output of map tasks.

 Has anyone seen Spark jobs for which this 'pipelined execution'
 strategy would be desirable while the current implementation is not quite
 adequate? Since Spark tasks usually run for a short period of time, I guess
 the new strategy would not have a major performance improvement. However,
 there might be some category of Spark jobs for which this new strategy
 would be clearly a better choice.

 Thanks,

 --- Sungwoo

 --

Thanks,
Russell Jurney @rjurney 
russell.jur...@gmail.com LI  FB
 datasyndrome.com


Re: Pipelined execution in Spark (???)

2022-09-07 Thread Sungwoo Park
Yes, we can get reduce tasks started when there are enough resources in the
cluster. As you point out, reduce tasks cannot produce their output while
map tasks are still running, but they can prefetch the output of map tasks.
In our prototype implementation of pipelined execution, everything works as
intended, but for typical Spark jobs (like SparkSQL jobs), we don't see
noticeable performance improvement because Spark tasks are mostly
short-running tasks. My question was if there would be some category of
Spark jobs that would benefit from pipelined execution.

Thanks,

--- Sungwoo

On Thu, Sep 8, 2022 at 7:51 AM Sean Owen  wrote:

> Wait, how do you start reduce tasks before maps are finished? is the idea
> that some reduce tasks don't depend on all the maps, or at least you can
> get started?
> You can already execute unrelated DAGs in parallel of course.
>
> On Wed, Sep 7, 2022 at 5:49 PM Sungwoo Park  wrote:
>
>> You are right -- Spark can't do this with its current architecture. My
>> question was: if there was a new implementation supporting pipelined
>> execution, what kind of Spark jobs would benefit (a lot) from it?
>>
>> Thanks,
>>
>> --- Sungwoo
>>
>> On Thu, Sep 8, 2022 at 1:47 AM Russell Jurney 
>> wrote:
>>
>>> I don't think Spark can do this with its current architecture. It has to
>>> wait for the step to be done, speculative execution isn't possible. Others
>>> probably know more about why that is.
>>>
>>> Thanks,
>>> Russell Jurney @rjurney 
>>> russell.jur...@gmail.com LI  FB
>>>  datasyndrome.com
>>>
>>>
>>> On Wed, Sep 7, 2022 at 7:42 AM Sungwoo Park  wrote:
>>>
 Hello Spark users,

 I have a question on the architecture of Spark (which could lead to a
 research problem). In its current implementation, Spark finishes executing
 all the tasks in a stage before proceeding to child stages. For example,
 given a two-stage map-reduce DAG, Spark finishes executing all the map
 tasks before scheduling reduce tasks.

 We can think of another 'pipelined execution' strategy in which tasks
 in child stages can be scheduled and executed concurrently with tasks in
 parent stages. For example, for the two-stage map-reduce DAG, while map
 tasks are being executed, we could schedule and execute reduce tasks in
 advance if the cluster has enough resources. These reduce tasks can also
 pre-fetch the output of map tasks.

 Has anyone seen Spark jobs for which this 'pipelined execution'
 strategy would be desirable while the current implementation is not quite
 adequate? Since Spark tasks usually run for a short period of time, I guess
 the new strategy would not have a major performance improvement. However,
 there might be some category of Spark jobs for which this new strategy
 would be clearly a better choice.

 Thanks,

 --- Sungwoo




Re: Pipelined execution in Spark (???)

2022-09-07 Thread Russell Jurney
Oops, it has been long since Russell labored on Hadoop, speculative
execution isn’t the right term - that is something else. Cascading has a
declarative interface so you can plan more, whereas Spark is more
imperative. Point remains :)

On Wed, Sep 7, 2022 at 3:56 PM Russell Jurney 
wrote:

> You want to talk to Chris Wensel, creator of cascading, a system that did
> speculative execution for a large volume of enterprise workloads. It was
> the first approachable way to scale workloads using Hadoop. He could write
> a book about this topic. Happy to introduce you if you’d like, or you could
> ask on the cascading user group.
>
> https://cascading.wensel.net/
>
> On Wed, Sep 7, 2022 at 3:49 PM Sungwoo Park  wrote:
>
>> You are right -- Spark can't do this with its current architecture. My
>> question was: if there was a new implementation supporting pipelined
>> execution, what kind of Spark jobs would benefit (a lot) from it?
>>
>> Thanks,
>>
>> --- Sungwoo
>>
>> On Thu, Sep 8, 2022 at 1:47 AM Russell Jurney 
>> wrote:
>>
>>> I don't think Spark can do this with its current architecture. It has to
>>> wait for the step to be done, speculative execution isn't possible. Others
>>> probably know more about why that is.
>>>
>>> Thanks,
>>> Russell Jurney @rjurney 
>>> russell.jur...@gmail.com LI  FB
>>>  datasyndrome.com
>>>
>>>
>>> On Wed, Sep 7, 2022 at 7:42 AM Sungwoo Park  wrote:
>>>
 Hello Spark users,

 I have a question on the architecture of Spark (which could lead to a
 research problem). In its current implementation, Spark finishes executing
 all the tasks in a stage before proceeding to child stages. For example,
 given a two-stage map-reduce DAG, Spark finishes executing all the map
 tasks before scheduling reduce tasks.

 We can think of another 'pipelined execution' strategy in which tasks
 in child stages can be scheduled and executed concurrently with tasks in
 parent stages. For example, for the two-stage map-reduce DAG, while map
 tasks are being executed, we could schedule and execute reduce tasks in
 advance if the cluster has enough resources. These reduce tasks can also
 pre-fetch the output of map tasks.

 Has anyone seen Spark jobs for which this 'pipelined execution'
 strategy would be desirable while the current implementation is not quite
 adequate? Since Spark tasks usually run for a short period of time, I guess
 the new strategy would not have a major performance improvement. However,
 there might be some category of Spark jobs for which this new strategy
 would be clearly a better choice.

 Thanks,

 --- Sungwoo

 --
>
> Thanks,
> Russell Jurney @rjurney 
> russell.jur...@gmail.com LI  FB
>  datasyndrome.com
>
-- 

Thanks,
Russell Jurney @rjurney 
russell.jur...@gmail.com LI  FB
 datasyndrome.com


Re: Pipelined execution in Spark (???)

2022-09-07 Thread Russell Jurney
You want to talk to Chris Wensel, creator of cascading, a system that did
speculative execution for a large volume of enterprise workloads. It was
the first approachable way to scale workloads using Hadoop. He could write
a book about this topic. Happy to introduce you if you’d like, or you could
ask on the cascading user group.

https://cascading.wensel.net/

On Wed, Sep 7, 2022 at 3:49 PM Sungwoo Park  wrote:

> You are right -- Spark can't do this with its current architecture. My
> question was: if there was a new implementation supporting pipelined
> execution, what kind of Spark jobs would benefit (a lot) from it?
>
> Thanks,
>
> --- Sungwoo
>
> On Thu, Sep 8, 2022 at 1:47 AM Russell Jurney 
> wrote:
>
>> I don't think Spark can do this with its current architecture. It has to
>> wait for the step to be done, speculative execution isn't possible. Others
>> probably know more about why that is.
>>
>> Thanks,
>> Russell Jurney @rjurney 
>> russell.jur...@gmail.com LI  FB
>>  datasyndrome.com
>>
>>
>> On Wed, Sep 7, 2022 at 7:42 AM Sungwoo Park  wrote:
>>
>>> Hello Spark users,
>>>
>>> I have a question on the architecture of Spark (which could lead to a
>>> research problem). In its current implementation, Spark finishes executing
>>> all the tasks in a stage before proceeding to child stages. For example,
>>> given a two-stage map-reduce DAG, Spark finishes executing all the map
>>> tasks before scheduling reduce tasks.
>>>
>>> We can think of another 'pipelined execution' strategy in which tasks in
>>> child stages can be scheduled and executed concurrently with tasks in
>>> parent stages. For example, for the two-stage map-reduce DAG, while map
>>> tasks are being executed, we could schedule and execute reduce tasks in
>>> advance if the cluster has enough resources. These reduce tasks can also
>>> pre-fetch the output of map tasks.
>>>
>>> Has anyone seen Spark jobs for which this 'pipelined execution' strategy
>>> would be desirable while the current implementation is not quite adequate?
>>> Since Spark tasks usually run for a short period of time, I guess the new
>>> strategy would not have a major performance improvement. However, there
>>> might be some category of Spark jobs for which this new strategy would be
>>> clearly a better choice.
>>>
>>> Thanks,
>>>
>>> --- Sungwoo
>>>
>>> --

Thanks,
Russell Jurney @rjurney 
russell.jur...@gmail.com LI  FB
 datasyndrome.com


Re: Pipelined execution in Spark (???)

2022-09-07 Thread Sean Owen
Wait, how do you start reduce tasks before maps are finished? is the idea
that some reduce tasks don't depend on all the maps, or at least you can
get started?
You can already execute unrelated DAGs in parallel of course.

On Wed, Sep 7, 2022 at 5:49 PM Sungwoo Park  wrote:

> You are right -- Spark can't do this with its current architecture. My
> question was: if there was a new implementation supporting pipelined
> execution, what kind of Spark jobs would benefit (a lot) from it?
>
> Thanks,
>
> --- Sungwoo
>
> On Thu, Sep 8, 2022 at 1:47 AM Russell Jurney 
> wrote:
>
>> I don't think Spark can do this with its current architecture. It has to
>> wait for the step to be done, speculative execution isn't possible. Others
>> probably know more about why that is.
>>
>> Thanks,
>> Russell Jurney @rjurney 
>> russell.jur...@gmail.com LI  FB
>>  datasyndrome.com
>>
>>
>> On Wed, Sep 7, 2022 at 7:42 AM Sungwoo Park  wrote:
>>
>>> Hello Spark users,
>>>
>>> I have a question on the architecture of Spark (which could lead to a
>>> research problem). In its current implementation, Spark finishes executing
>>> all the tasks in a stage before proceeding to child stages. For example,
>>> given a two-stage map-reduce DAG, Spark finishes executing all the map
>>> tasks before scheduling reduce tasks.
>>>
>>> We can think of another 'pipelined execution' strategy in which tasks in
>>> child stages can be scheduled and executed concurrently with tasks in
>>> parent stages. For example, for the two-stage map-reduce DAG, while map
>>> tasks are being executed, we could schedule and execute reduce tasks in
>>> advance if the cluster has enough resources. These reduce tasks can also
>>> pre-fetch the output of map tasks.
>>>
>>> Has anyone seen Spark jobs for which this 'pipelined execution' strategy
>>> would be desirable while the current implementation is not quite adequate?
>>> Since Spark tasks usually run for a short period of time, I guess the new
>>> strategy would not have a major performance improvement. However, there
>>> might be some category of Spark jobs for which this new strategy would be
>>> clearly a better choice.
>>>
>>> Thanks,
>>>
>>> --- Sungwoo
>>>
>>>


Re: Pipelined execution in Spark (???)

2022-09-07 Thread Sungwoo Park
You are right -- Spark can't do this with its current architecture. My
question was: if there was a new implementation supporting pipelined
execution, what kind of Spark jobs would benefit (a lot) from it?

Thanks,

--- Sungwoo

On Thu, Sep 8, 2022 at 1:47 AM Russell Jurney 
wrote:

> I don't think Spark can do this with its current architecture. It has to
> wait for the step to be done, speculative execution isn't possible. Others
> probably know more about why that is.
>
> Thanks,
> Russell Jurney @rjurney 
> russell.jur...@gmail.com LI  FB
>  datasyndrome.com
>
>
> On Wed, Sep 7, 2022 at 7:42 AM Sungwoo Park  wrote:
>
>> Hello Spark users,
>>
>> I have a question on the architecture of Spark (which could lead to a
>> research problem). In its current implementation, Spark finishes executing
>> all the tasks in a stage before proceeding to child stages. For example,
>> given a two-stage map-reduce DAG, Spark finishes executing all the map
>> tasks before scheduling reduce tasks.
>>
>> We can think of another 'pipelined execution' strategy in which tasks in
>> child stages can be scheduled and executed concurrently with tasks in
>> parent stages. For example, for the two-stage map-reduce DAG, while map
>> tasks are being executed, we could schedule and execute reduce tasks in
>> advance if the cluster has enough resources. These reduce tasks can also
>> pre-fetch the output of map tasks.
>>
>> Has anyone seen Spark jobs for which this 'pipelined execution' strategy
>> would be desirable while the current implementation is not quite adequate?
>> Since Spark tasks usually run for a short period of time, I guess the new
>> strategy would not have a major performance improvement. However, there
>> might be some category of Spark jobs for which this new strategy would be
>> clearly a better choice.
>>
>> Thanks,
>>
>> --- Sungwoo
>>
>>


Re: Pipelined execution in Spark (???)

2022-09-07 Thread Russell Jurney
I don't think Spark can do this with its current architecture. It has to
wait for the step to be done, speculative execution isn't possible. Others
probably know more about why that is.

Thanks,
Russell Jurney @rjurney 
russell.jur...@gmail.com LI  FB
 datasyndrome.com


On Wed, Sep 7, 2022 at 7:42 AM Sungwoo Park  wrote:

> Hello Spark users,
>
> I have a question on the architecture of Spark (which could lead to a
> research problem). In its current implementation, Spark finishes executing
> all the tasks in a stage before proceeding to child stages. For example,
> given a two-stage map-reduce DAG, Spark finishes executing all the map
> tasks before scheduling reduce tasks.
>
> We can think of another 'pipelined execution' strategy in which tasks in
> child stages can be scheduled and executed concurrently with tasks in
> parent stages. For example, for the two-stage map-reduce DAG, while map
> tasks are being executed, we could schedule and execute reduce tasks in
> advance if the cluster has enough resources. These reduce tasks can also
> pre-fetch the output of map tasks.
>
> Has anyone seen Spark jobs for which this 'pipelined execution' strategy
> would be desirable while the current implementation is not quite adequate?
> Since Spark tasks usually run for a short period of time, I guess the new
> strategy would not have a major performance improvement. However, there
> might be some category of Spark jobs for which this new strategy would be
> clearly a better choice.
>
> Thanks,
>
> --- Sungwoo
>
>


Pipelined execution in Spark (???)

2022-09-07 Thread Sungwoo Park
Hello Spark users,

I have a question on the architecture of Spark (which could lead to a
research problem). In its current implementation, Spark finishes executing
all the tasks in a stage before proceeding to child stages. For example,
given a two-stage map-reduce DAG, Spark finishes executing all the map
tasks before scheduling reduce tasks.

We can think of another 'pipelined execution' strategy in which tasks in
child stages can be scheduled and executed concurrently with tasks in
parent stages. For example, for the two-stage map-reduce DAG, while map
tasks are being executed, we could schedule and execute reduce tasks in
advance if the cluster has enough resources. These reduce tasks can also
pre-fetch the output of map tasks.

Has anyone seen Spark jobs for which this 'pipelined execution' strategy
would be desirable while the current implementation is not quite adequate?
Since Spark tasks usually run for a short period of time, I guess the new
strategy would not have a major performance improvement. However, there
might be some category of Spark jobs for which this new strategy would be
clearly a better choice.

Thanks,

--- Sungwoo