[ https://issues.apache.org/jira/browse/SPARK-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568019#comment-14568019 ]
Tathagata Das edited comment on SPARK-1312 at 6/1/15 9:16 PM: -------------------------------------------------------------- I have verified this in current Spark and I am closing this JIRa. was (Author: tdas): I have verified this and I am closing this JIRa. > Batch should read based on the batch interval provided in the StreamingContext > ------------------------------------------------------------------------------ > > Key: SPARK-1312 > URL: https://issues.apache.org/jira/browse/SPARK-1312 > Project: Spark > Issue Type: Bug > Components: Streaming > Affects Versions: 0.9.0 > Reporter: Sanjay Awatramani > Assignee: Tathagata Das > Priority: Critical > Labels: sliding, streaming, window > > This problem primarily affects sliding window operations in spark streaming. > Consider the following scenario: > - a DStream is created from any source. (I've checked with file and socket) > - No actions are applied to this DStream > - Sliding Window operation is applied to this DStream and an action is > applied to the sliding window. > In this case, Spark will not even read the input stream in the batch in which > the sliding interval isn't a multiple of batch interval. Put another way, it > won't read the input when it doesn't have to apply the window function. This > is happening because all transformations in Spark are lazy. > How to fix this or workaround it (see line#3): > JavaStreamingContext stcObj = new JavaStreamingContext(confObj, new > Duration(1 * 60 * 1000)); > JavaDStream<String> inputStream = stcObj.textFileStream("/Input"); > inputStream.print(); // This is the workaround > JavaDStream<String> objWindow = inputStream.window(new > Duration(windowLen*60*1000), new Duration(slideInt*60*1000)); > objWindow.dstream().saveAsTextFiles("/Output", ""); > The "Window operations" example on the streaming guide implies that Spark > will read the stream in every batch, which is not happening because of the > lazy transformations. > Wherever sliding window would be used, in most of the cases, no actions will > be taken on the pre-window batch, hence my gut feeling was that Streaming > would read every batch if any actions are being taken in the windowed stream. > As per Tathagata, > "Ideally every batch should read based on the batch interval provided in the > StreamingContext." > Refer the original thread on > http://apache-spark-user-list.1001560.n3.nabble.com/Sliding-Window-operations-do-not-work-as-documented-tp2999.html > for more details, including Tathagata's conclusion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org