[jira] [Updated] (SPARK-7441) Implement microbatch functionality so that Spark Streaming can process a large backlog of existing files discovered in batch in smaller batches

2015-11-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7441:
-
Target Version/s:   (was: 1.6.0)

> Implement microbatch functionality so that Spark Streaming can process a 
> large backlog of existing files discovered in batch in smaller batches
> ---
>
> Key: SPARK-7441
> URL: https://issues.apache.org/jira/browse/SPARK-7441
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Emre Sevinç
>  Labels: performance
>
> Implement microbatch functionality so that Spark Streaming can process a huge 
> backlog of existing files discovered in batch in smaller batches.
> Spark Streaming can process already existing files in a directory, and 
> depending on the value of "{{spark.streaming.minRememberDuration}}" (60 
> seconds by default, see SPARK-3276 for more details), this might mean that a 
> Spark Streaming application can receive thousands, or hundreds of thousands 
> of files within the first batch interval. This, in turn, leads to something 
> like a 'flooding' effect for the streaming application, that tries to deal 
> with a huge number of existing files in a single batch interval.
>  We will propose a very simple change to 
> {{org.apache.spark.streaming.dstream.FileInputDStream}}, so that, based on a 
> configuration property such as "{{spark.streaming.microbatch.size}}", it will 
> either keep its default behavior when  {{spark.streaming.microbatch.size}} 
> will have the default value of {{0}} (meaning as many as has been discovered 
> as new files in the current batch interval), or will process new files in 
> groups of {{spark.streaming.microbatch.size}} (e.g. in groups of 100s).
> We have tested this patch in one of our customers, and it's been running 
> successfully for weeks (e.g. there were cases where our Spark Streaming 
> application was stopped, and in the meantime tens of thousands file were 
> created in a directory, and our Spark Streaming application had to process 
> those existing files after it was started).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7441) Implement microbatch functionality so that Spark Streaming can process a large backlog of existing files discovered in batch in smaller batches

2015-08-03 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-7441:
-
Target Version/s: 1.6.0  (was: 1.5.0)

 Implement microbatch functionality so that Spark Streaming can process a 
 large backlog of existing files discovered in batch in smaller batches
 ---

 Key: SPARK-7441
 URL: https://issues.apache.org/jira/browse/SPARK-7441
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Emre Sevinç
  Labels: performance

 Implement microbatch functionality so that Spark Streaming can process a huge 
 backlog of existing files discovered in batch in smaller batches.
 Spark Streaming can process already existing files in a directory, and 
 depending on the value of {{spark.streaming.minRememberDuration}} (60 
 seconds by default, see SPARK-3276 for more details), this might mean that a 
 Spark Streaming application can receive thousands, or hundreds of thousands 
 of files within the first batch interval. This, in turn, leads to something 
 like a 'flooding' effect for the streaming application, that tries to deal 
 with a huge number of existing files in a single batch interval.
  We will propose a very simple change to 
 {{org.apache.spark.streaming.dstream.FileInputDStream}}, so that, based on a 
 configuration property such as {{spark.streaming.microbatch.size}}, it will 
 either keep its default behavior when  {{spark.streaming.microbatch.size}} 
 will have the default value of {{0}} (meaning as many as has been discovered 
 as new files in the current batch interval), or will process new files in 
 groups of {{spark.streaming.microbatch.size}} (e.g. in groups of 100s).
 We have tested this patch in one of our customers, and it's been running 
 successfully for weeks (e.g. there were cases where our Spark Streaming 
 application was stopped, and in the meantime tens of thousands file were 
 created in a directory, and our Spark Streaming application had to process 
 those existing files after it was started).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7441) Implement microbatch functionality so that Spark Streaming can process a large backlog of existing files discovered in batch in smaller batches

2015-05-07 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-7441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emre Sevinç updated SPARK-7441:
---
Description: 
Implement microbatch functionality so that Spark Streaming can process a huge 
backlog of existing files discovered in batch in smaller batches.

Spark Streaming can process already existing files in a directory, and 
depending on the value of {{spark.streaming.minRememberDuration}} (60 seconds 
by default, see SPARK-3276 for more details), this might mean that a Spark 
Streaming application can receive thousands, or hundreds of thousands of files 
within the first batch interval. This, in turn, leads to something like a 
'flooding' effect for the streaming application, that tries to deal with a huge 
number of existing files in a single batch interval.

 We will propose a very simple change to 
{{org.apache.spark.streaming.dstream.FileInputDStream}}, so that, based on a 
configuration property such as {{spark.streaming.microbatch.size}}, it will 
either keep its default behavior when  {{spark.streaming.microbatch.size}} will 
have the default value of {{0}} (meaning as many as has been discovered as new 
files in the current batch interval), or will process new files in groups of 
{{spark.streaming.microbatch.size}} (e.g. in groups of 100s).

We have tested this patch in one of our customers, and it's been running 
successfully for weeks (e.g. there were cases where our Spark Streaming 
application was stopped, and in the meantime tens of thousands file were 
created in a directory, and our Spark Streaming application had to process 
those existing files after it was started).

  was:
Implement microbatch functionality so that Spark Streaming can process a huge 
backlog of existing files discovered in batch in smaller batches.

Spark Streaming can process already existing files in a directory, and 
depending on the value of {{spark.streaming.minRememberDuration}} (60 seconds 
by default, see SPARK-3276 for more details), this might mean that a Spark 
Streaming application can receive thousands, or hundreds of thousands of files 
within the first batch interval. This, in turn, leads to something like a 
'flooding' effect for the streaming application, that tries to deal with a huge 
number of existing files in a single batch interval.

 We will propose a very simple change to 
{{org.apache.spark.streaming.dstream.FileInputDStream}}, so that, based on a 
configuration property such as {{spark.streaming.microbatch.size}}, it will 
either keep its default behavior when  {{spark.streaming.microbatch.size}} will 
have the default value of {{0}} (infinite), or will process new files in groups 
of {{spark.streaming.microbatch.size}} (e.g. in groups of 100s).

We have tested this patch in one of our customers, and it's been running 
successfully for weeks (e.g. there were cases where our Spark Streaming 
application was stopped, and in the meantime tens of thousands file were 
created in a directory, and our Spark Streaming application had to process 
those existing files after it was started).


 Implement microbatch functionality so that Spark Streaming can process a 
 large backlog of existing files discovered in batch in smaller batches
 ---

 Key: SPARK-7441
 URL: https://issues.apache.org/jira/browse/SPARK-7441
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Emre Sevinç
  Labels: performance

 Implement microbatch functionality so that Spark Streaming can process a huge 
 backlog of existing files discovered in batch in smaller batches.
 Spark Streaming can process already existing files in a directory, and 
 depending on the value of {{spark.streaming.minRememberDuration}} (60 
 seconds by default, see SPARK-3276 for more details), this might mean that a 
 Spark Streaming application can receive thousands, or hundreds of thousands 
 of files within the first batch interval. This, in turn, leads to something 
 like a 'flooding' effect for the streaming application, that tries to deal 
 with a huge number of existing files in a single batch interval.
  We will propose a very simple change to 
 {{org.apache.spark.streaming.dstream.FileInputDStream}}, so that, based on a 
 configuration property such as {{spark.streaming.microbatch.size}}, it will 
 either keep its default behavior when  {{spark.streaming.microbatch.size}} 
 will have the default value of {{0}} (meaning as many as has been discovered 
 as new files in the current batch interval), or will process new files in 
 groups of {{spark.streaming.microbatch.size}} (e.g. in groups of 100s).
 We have tested this patch in one of our customers, and it's been running 
 successfully for weeks (e.g. there