Dear all, Can someone please explain me how Spark Streaming executes the window() operation? From the Spark 1.6.1 documentation, it seems that windowed batches are automatically cached in memory, but looking at the web UI it seems that operations already executed in previous batches are executed again. For your convenience, I attach a screenshot of my running application below.
By looking at the web UI, it seems that flatMapValues() RDDs are cached (green spot - this is the last operation executed before I call window() on the DStream), but, at the same time, it also seems that all the transformations that led to flatMapValues() in previous batches are executed again. If this is the case, the window() operation may induce huge performance penalties, especially if the window duration is 1 or 2 hours (as I expect for my application). Do you think that checkpointing the DStream at that time can be helpful? Consider that the expected slide window is about 5 minutes. Hope someone can clarify this point. Thanks, Marco <http://apache-spark-user-list.1001560.n3.nabble.com/file/n27041/window.png> -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Is-window-caching-DStreams-tp27041.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org