Re: Spark streaming - tasks and stages continue to be generated when using reduce by key

Tathagata Das Fri, 11 Jul 2014 10:52:28 -0700

Whenever you need to do a shuffle=based operation like reduceByKey,
groupByKey, join, etc., the system is essentially redistributing the data
across the cluster and it needs to know how many parts should it divide the
data into. Thats where the default parallelism is used.


TD


On Fri, Jul 11, 2014 at 3:16 AM, M Singh <mans6si...@yahoo.com> wrote:

> Hi TD:
>
> The input file is on hdfs.
>
>  The file is approx 2.7 GB and when the process starts, there are 11
> tasks (since hdfs block size is 256M) for processing and 2 tasks for reduce
> by key.  After the file has been processed, I see new stages with 2 tasks
> that continue to be generated. I understand this value (2) is the default
> value for spark.default.parallelism but don't quite understand how is the
> value determined for generating tasks for reduceByKey, how is it used
> besides reduceByKey and what should be the optimal value for this.
>
>  Thanks.
>
>
>   On Thursday, July 10, 2014 7:24 PM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>
> How are you supplying the text file?
>
>
> On Wed, Jul 9, 2014 at 11:51 AM, M Singh <mans6si...@yahoo.com> wrote:
>
> Hi Folks:
>
> I am working on an application which uses spark streaming (version 1.1.0
> snapshot on a standalone cluster) to process text file and save counters in
> cassandra based on fields in each row.  I am testing the application in two
> modes:
>
>    - Process each row and save the counter in cassandra.  In this
>    scenario after the text file has been consumed, there is no task/stages
>    seen in the spark UI.
>    - If instead I use reduce by key before saving to cassandra, the spark
>    UI shows continuous generation of tasks/stages even after processing the
>    file has been completed.
>
> I believe this is because the reduce by key requires merging of data from
> different partitions.  But I was wondering if anyone has any
> insights/pointers for understanding this difference in behavior and how to
> avoid generating tasks/stages when there is no data (new file) available.
>
> Thanks
>
> Mans
>
>
>
>
>

Re: Spark streaming - tasks and stages continue to be generated when using reduce by key

Reply via email to