Re: Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-14 Thread Tathagata Das
The depends on your requirements. If you want to process the 250 GB input
file as a stream to emulate the stream of data, then it should be split
into files (such that event ordering is maintained in those splits, if
necessary). And then those splits should be moved one-by-one in the
directory monitored by the streaming app. You will need to figure out the
split size, etc, depending on what is your intended batch size (in terms of
seconds) in the streaming app.
And it doesnt really need to be a multiple of hdfs block sizes.

TD



On Sat, Jul 12, 2014 at 7:31 AM, M Singh mans6si...@yahoo.com wrote:

 Thanks TD.

 BTW - If I have input file ~ 250 GBs - Is there any guideline on whether
 to use:

- a single input (250 GB) (in this case is there any max upper bound)
or
- split into 1000 files each of 250 MB (hdfs block size is 250 MB) or
- a multiple of hdfs block size.

 Mans



   On Friday, July 11, 2014 4:38 PM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:


 The model for file stream is to pick up and process new files written
 atomically (by move) into a directory. So your file is being processed in a
 single batch, and then its waiting for any new files to be written into
 that directory.

 TD


 On Fri, Jul 11, 2014 at 11:46 AM, M Singh mans6si...@yahoo.com wrote:

 So, is it expected for the process to generate stages/tasks even after
 processing a file ?

 Also, is there a way to figure out the file that is getting processed and
 when that process is complete ?

 Thanks


   On Friday, July 11, 2014 1:51 PM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:


 Whenever you need to do a shuffle=based operation like reduceByKey,
 groupByKey, join, etc., the system is essentially redistributing the data
 across the cluster and it needs to know how many parts should it divide the
 data into. Thats where the default parallelism is used.

 TD


 On Fri, Jul 11, 2014 at 3:16 AM, M Singh mans6si...@yahoo.com wrote:

 Hi TD:

 The input file is on hdfs.

  The file is approx 2.7 GB and when the process starts, there are 11
 tasks (since hdfs block size is 256M) for processing and 2 tasks for reduce
 by key.  After the file has been processed, I see new stages with 2 tasks
 that continue to be generated. I understand this value (2) is the default
 value for spark.default.parallelism but don't quite understand how is the
 value determined for generating tasks for reduceByKey, how is it used
 besides reduceByKey and what should be the optimal value for this.

  Thanks.


   On Thursday, July 10, 2014 7:24 PM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:


 How are you supplying the text file?


 On Wed, Jul 9, 2014 at 11:51 AM, M Singh mans6si...@yahoo.com wrote:

 Hi Folks:

 I am working on an application which uses spark streaming (version 1.1.0
 snapshot on a standalone cluster) to process text file and save counters in
 cassandra based on fields in each row.  I am testing the application in two
 modes:

- Process each row and save the counter in cassandra.  In this
scenario after the text file has been consumed, there is no task/stages
seen in the spark UI.
- If instead I use reduce by key before saving to cassandra, the spark
UI shows continuous generation of tasks/stages even after processing the
file has been completed.

 I believe this is because the reduce by key requires merging of data from
 different partitions.  But I was wondering if anyone has any
 insights/pointers for understanding this difference in behavior and how to
 avoid generating tasks/stages when there is no data (new file) available.

 Thanks

 Mans













Re: Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-12 Thread M Singh
Thanks TD.

BTW - If I have input file ~ 250 GBs - Is there any guideline on whether to use:

* a single input (250 GB) (in this case is there any max upper bound) 
or 

* split into 1000 files each of 250 MB (hdfs block size is 250 MB) or 

* a multiple of hdfs block size.
Mans





On Friday, July 11, 2014 4:38 PM, Tathagata Das tathagata.das1...@gmail.com 
wrote:
 


The model for file stream is to pick up and process new files written 
atomically (by move) into a directory. So your file is being processed in a 
single batch, and then its waiting for any new files to be written into that 
directory. 

TD



On Fri, Jul 11, 2014 at 11:46 AM, M Singh mans6si...@yahoo.com wrote:

So, is it expected for the process to generate stages/tasks even after 
processing a file ?


Also, is there a way to figure out the file that is getting processed and when 
that process is complete ?


Thanks




On Friday, July 11, 2014 1:51 PM, Tathagata Das tathagata.das1...@gmail.com 
wrote:
 


Whenever you need to do a shuffle=based operation like reduceByKey, 
groupByKey, join, etc., the system is essentially redistributing the data 
across the cluster and it needs to know how many parts should it divide the 
data into. Thats where the default parallelism is used. 


TD



On Fri, Jul 11, 2014 at 3:16 AM, M Singh mans6si...@yahoo.com wrote:

Hi TD:


The input file is on hdfs.  


The file is approx 2.7 GB and when the process starts, there are 11 tasks 
(since hdfs block size is 256M) for processing and 2 tasks for reduce by key. 
 After the file has been processed, I see new stages with 2 tasks that 
continue to be generated. I understand this value (2) is the default value 
for spark.default.parallelism but don't quite understand how is the value 
determined for generating tasks for reduceByKey, how is it used besides 
reduceByKey and what should be the optimal value for this. 


Thanks.



On Thursday, July 10, 2014 7:24 PM, Tathagata Das 
tathagata.das1...@gmail.com wrote:
 


How are you supplying the text file? 



On Wed, Jul 9, 2014 at 11:51 AM, M Singh mans6si...@yahoo.com wrote:

Hi Folks:



I am working on an application which uses spark streaming (version 1.1.0 
snapshot on a standalone cluster) to process text file and save counters in 
cassandra based on fields in each row.  I am testing the application in two 
modes:  

 * Process each row and save the counter in cassandra.  In this scenario 
 after the text file has been consumed, there is no task/stages seen in the 
 spark UI.

 * If instead I use reduce by key before saving to cassandra, the spark 
 UI shows continuous generation of tasks/stages even after processing the 
 file has been completed. 

I believe this is because the reduce by key requires merging of data from 
different partitions.  But I was wondering if anyone has any 
insights/pointers for understanding this difference in behavior and how to 
avoid generating tasks/stages when there is no data (new file) available.


Thanks

Mans







Re: Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-11 Thread M Singh
Hi TD:

The input file is on hdfs.  

The file is approx 2.7 GB and when the process starts, there are 11 tasks 
(since hdfs block size is 256M) for processing and 2 tasks for reduce by key.  
After the file has been processed, I see new stages with 2 tasks that continue 
to be generated. I understand this value (2) is the default value for 
spark.default.parallelism but don't quite understand how is the value 
determined for generating tasks for reduceByKey, how is it used besides 
reduceByKey and what should be the optimal value for this. 

Thanks.


On Thursday, July 10, 2014 7:24 PM, Tathagata Das tathagata.das1...@gmail.com 
wrote:
 


How are you supplying the text file? 



On Wed, Jul 9, 2014 at 11:51 AM, M Singh mans6si...@yahoo.com wrote:

Hi Folks:



I am working on an application which uses spark streaming (version 1.1.0 
snapshot on a standalone cluster) to process text file and save counters in 
cassandra based on fields in each row.  I am testing the application in two 
modes:  

   * Process each row and save the counter in cassandra.  In this scenario 
 after the text file has been consumed, there is no task/stages seen in the 
 spark UI.

   * If instead I use reduce by key before saving to cassandra, the spark 
 UI shows continuous generation of tasks/stages even after processing the file 
 has been completed. 

I believe this is because the reduce by key requires merging of data from 
different partitions.  But I was wondering if anyone has any insights/pointers 
for understanding this difference in behavior and how to avoid generating 
tasks/stages when there is no data (new file) available.


Thanks

Mans

Re: Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-11 Thread Tathagata Das
Whenever you need to do a shuffle=based operation like reduceByKey,
groupByKey, join, etc., the system is essentially redistributing the data
across the cluster and it needs to know how many parts should it divide the
data into. Thats where the default parallelism is used.

TD


On Fri, Jul 11, 2014 at 3:16 AM, M Singh mans6si...@yahoo.com wrote:

 Hi TD:

 The input file is on hdfs.

  The file is approx 2.7 GB and when the process starts, there are 11
 tasks (since hdfs block size is 256M) for processing and 2 tasks for reduce
 by key.  After the file has been processed, I see new stages with 2 tasks
 that continue to be generated. I understand this value (2) is the default
 value for spark.default.parallelism but don't quite understand how is the
 value determined for generating tasks for reduceByKey, how is it used
 besides reduceByKey and what should be the optimal value for this.

  Thanks.


   On Thursday, July 10, 2014 7:24 PM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:


 How are you supplying the text file?


 On Wed, Jul 9, 2014 at 11:51 AM, M Singh mans6si...@yahoo.com wrote:

 Hi Folks:

 I am working on an application which uses spark streaming (version 1.1.0
 snapshot on a standalone cluster) to process text file and save counters in
 cassandra based on fields in each row.  I am testing the application in two
 modes:

- Process each row and save the counter in cassandra.  In this
scenario after the text file has been consumed, there is no task/stages
seen in the spark UI.
- If instead I use reduce by key before saving to cassandra, the spark
UI shows continuous generation of tasks/stages even after processing the
file has been completed.

 I believe this is because the reduce by key requires merging of data from
 different partitions.  But I was wondering if anyone has any
 insights/pointers for understanding this difference in behavior and how to
 avoid generating tasks/stages when there is no data (new file) available.

 Thanks

 Mans







Re: Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-11 Thread M Singh
So, is it expected for the process to generate stages/tasks even after 
processing a file ?

Also, is there a way to figure out the file that is getting processed and when 
that process is complete ?

Thanks



On Friday, July 11, 2014 1:51 PM, Tathagata Das tathagata.das1...@gmail.com 
wrote:
 


Whenever you need to do a shuffle=based operation like reduceByKey, groupByKey, 
join, etc., the system is essentially redistributing the data across the 
cluster and it needs to know how many parts should it divide the data into. 
Thats where the default parallelism is used. 

TD



On Fri, Jul 11, 2014 at 3:16 AM, M Singh mans6si...@yahoo.com wrote:

Hi TD:


The input file is on hdfs.  


The file is approx 2.7 GB and when the process starts, there are 11 tasks 
(since hdfs block size is 256M) for processing and 2 tasks for reduce by key.  
After the file has been processed, I see new stages with 2 tasks that continue 
to be generated. I understand this value (2) is the default value for 
spark.default.parallelism but don't quite understand how is the value 
determined for generating tasks for reduceByKey, how is it used besides 
reduceByKey and what should be the optimal value for this. 


Thanks.



On Thursday, July 10, 2014 7:24 PM, Tathagata Das 
tathagata.das1...@gmail.com wrote:
 


How are you supplying the text file? 



On Wed, Jul 9, 2014 at 11:51 AM, M Singh mans6si...@yahoo.com wrote:

Hi Folks:



I am working on an application which uses spark streaming (version 1.1.0 
snapshot on a standalone cluster) to process text file and save counters in 
cassandra based on fields in each row.  I am testing the application in two 
modes:  

  * Process each row and save the counter in cassandra.  In this scenario 
 after the text file has been consumed, there is no task/stages seen in the 
 spark UI.

  * If instead I use reduce by key before saving to cassandra, the spark 
 UI shows continuous generation of tasks/stages even after processing the 
 file has been completed. 

I believe this is because the reduce by key requires merging of data from 
different partitions.  But I was wondering if anyone has any 
insights/pointers for understanding this difference in behavior and how to 
avoid generating tasks/stages when there is no data (new file) available.


Thanks

Mans




Re: Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-11 Thread Tathagata Das
The model for file stream is to pick up and process new files written
atomically (by move) into a directory. So your file is being processed in a
single batch, and then its waiting for any new files to be written into
that directory.

TD


On Fri, Jul 11, 2014 at 11:46 AM, M Singh mans6si...@yahoo.com wrote:

 So, is it expected for the process to generate stages/tasks even after
 processing a file ?

 Also, is there a way to figure out the file that is getting processed and
 when that process is complete ?

 Thanks


   On Friday, July 11, 2014 1:51 PM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:


 Whenever you need to do a shuffle=based operation like reduceByKey,
 groupByKey, join, etc., the system is essentially redistributing the data
 across the cluster and it needs to know how many parts should it divide the
 data into. Thats where the default parallelism is used.

 TD


 On Fri, Jul 11, 2014 at 3:16 AM, M Singh mans6si...@yahoo.com wrote:

 Hi TD:

 The input file is on hdfs.

  The file is approx 2.7 GB and when the process starts, there are 11
 tasks (since hdfs block size is 256M) for processing and 2 tasks for reduce
 by key.  After the file has been processed, I see new stages with 2 tasks
 that continue to be generated. I understand this value (2) is the default
 value for spark.default.parallelism but don't quite understand how is the
 value determined for generating tasks for reduceByKey, how is it used
 besides reduceByKey and what should be the optimal value for this.

  Thanks.


   On Thursday, July 10, 2014 7:24 PM, Tathagata Das 
 tathagata.das1...@gmail.com wrote:


 How are you supplying the text file?


 On Wed, Jul 9, 2014 at 11:51 AM, M Singh mans6si...@yahoo.com wrote:

 Hi Folks:

 I am working on an application which uses spark streaming (version 1.1.0
 snapshot on a standalone cluster) to process text file and save counters in
 cassandra based on fields in each row.  I am testing the application in two
 modes:

- Process each row and save the counter in cassandra.  In this
scenario after the text file has been consumed, there is no task/stages
seen in the spark UI.
- If instead I use reduce by key before saving to cassandra, the spark
UI shows continuous generation of tasks/stages even after processing the
file has been completed.

 I believe this is because the reduce by key requires merging of data from
 different partitions.  But I was wondering if anyone has any
 insights/pointers for understanding this difference in behavior and how to
 avoid generating tasks/stages when there is no data (new file) available.

 Thanks

 Mans










Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-09 Thread M Singh
Hi Folks:


I am working on an application which uses spark streaming (version 1.1.0 
snapshot on a standalone cluster) to process text file and save counters in 
cassandra based on fields in each row.  I am testing the application in two 
modes:  

* Process each row and save the counter in cassandra.  In this scenario 
after the text file has been consumed, there is no task/stages seen in the 
spark UI.

* If instead I use reduce by key before saving to cassandra, the spark 
UI shows continuous generation of tasks/stages even afterprocessing the file 
has been completed.

I believe this is because the reduce by key requires merging of data from 
different partitions.  But I was wondering if anyone has any insights/pointers 
for understanding this difference in behavior and how to avoid generating 
tasks/stages when there is no data (new file) available.

Thanks

Mans