[ 
https://issues.apache.org/jira/browse/SPARK-7441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-7441:
---------------------------------
    Target Version/s: 1.6.0  (was: 1.5.0)

> Implement microbatch functionality so that Spark Streaming can process a 
> large backlog of existing files discovered in batch in smaller batches
> -----------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-7441
>                 URL: https://issues.apache.org/jira/browse/SPARK-7441
>             Project: Spark
>          Issue Type: Improvement
>          Components: Streaming
>            Reporter: Emre Sevinç
>              Labels: performance
>
> Implement microbatch functionality so that Spark Streaming can process a huge 
> backlog of existing files discovered in batch in smaller batches.
> Spark Streaming can process already existing files in a directory, and 
> depending on the value of "{{spark.streaming.minRememberDuration}}" (60 
> seconds by default, see SPARK-3276 for more details), this might mean that a 
> Spark Streaming application can receive thousands, or hundreds of thousands 
> of files within the first batch interval. This, in turn, leads to something 
> like a 'flooding' effect for the streaming application, that tries to deal 
> with a huge number of existing files in a single batch interval.
>  We will propose a very simple change to 
> {{org.apache.spark.streaming.dstream.FileInputDStream}}, so that, based on a 
> configuration property such as "{{spark.streaming.microbatch.size}}", it will 
> either keep its default behavior when  {{spark.streaming.microbatch.size}} 
> will have the default value of {{0}} (meaning as many as has been discovered 
> as new files in the current batch interval), or will process new files in 
> groups of {{spark.streaming.microbatch.size}} (e.g. in groups of 100s).
> We have tested this patch in one of our customers, and it's been running 
> successfully for weeks (e.g. there were cases where our Spark Streaming 
> application was stopped, and in the meantime tens of thousands file were 
> created in a directory, and our Spark Streaming application had to process 
> those existing files after it was started).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to