Hi, I am using spark streaming check-pointing mechanism and reading the data from Kafka. The window duration for my application is 2 hrs with a sliding interval of 15 minutes.
So, my batches run at following intervals... - 09:45 - 10:00 - 10:15 - 10:30 - and so on When my job is restarted, and recovers from the checkpoint it does the re-partitioning step twice for each 15 minute job until the window of 2 hours is complete. Then the re-partitioning takes place only once. For example - when the job recovers at 16:15 it does re-partitioning for the 16:15 Kafka stream and the 14:15 Kafka stream as well. Also, all the other intermediate stages are computed for 16:15 batch. I am using reduceByKeyAndWindow with inverse function. Now once the 2 hrs window is complete 18:15 onward re-partitioning takes place only once. Seems like the checkpoint does not have RDD stored for beyond 2 hrs which is my window duration. Because of this my job takes more time than usual. Is there a way or some configuration parameter which would help avoid repartitioning twice ? Attaching the snaps when repartitioning takes place twice after recovery from checkpoint. Thanks !! Kundan
--------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org