[ https://issues.apache.org/jira/browse/SPARK-7441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tathagata Das updated SPARK-7441: --------------------------------- Target Version/s: 1.6.0 (was: 1.5.0) > Implement microbatch functionality so that Spark Streaming can process a > large backlog of existing files discovered in batch in smaller batches > ----------------------------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-7441 > URL: https://issues.apache.org/jira/browse/SPARK-7441 > Project: Spark > Issue Type: Improvement > Components: Streaming > Reporter: Emre Sevinç > Labels: performance > > Implement microbatch functionality so that Spark Streaming can process a huge > backlog of existing files discovered in batch in smaller batches. > Spark Streaming can process already existing files in a directory, and > depending on the value of "{{spark.streaming.minRememberDuration}}" (60 > seconds by default, see SPARK-3276 for more details), this might mean that a > Spark Streaming application can receive thousands, or hundreds of thousands > of files within the first batch interval. This, in turn, leads to something > like a 'flooding' effect for the streaming application, that tries to deal > with a huge number of existing files in a single batch interval. > We will propose a very simple change to > {{org.apache.spark.streaming.dstream.FileInputDStream}}, so that, based on a > configuration property such as "{{spark.streaming.microbatch.size}}", it will > either keep its default behavior when {{spark.streaming.microbatch.size}} > will have the default value of {{0}} (meaning as many as has been discovered > as new files in the current batch interval), or will process new files in > groups of {{spark.streaming.microbatch.size}} (e.g. in groups of 100s). > We have tested this patch in one of our customers, and it's been running > successfully for weeks (e.g. there were cases where our Spark Streaming > application was stopped, and in the meantime tens of thousands file were > created in a directory, and our Spark Streaming application had to process > those existing files after it was started). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org