Hi Spark guru I am evaluating Spark Streaming,
In my application I need to maintain cumulative statistics (e.g the total running word count), so I need to call the updateStateByKey function on very micro-batch. After setting those things, I got following behaviors: * The Processing Time is very high every 10 seconds - usually 5x higher (which I guess it's data checking point job) * The Processing Time becomes higher and higher over time, after 10 minutes it's much higher than the batch interval and lead to huge Scheduling Delay and a lots Active Batches in queue. My questions is: * Is this expected behavior? Is there any way to improve the performance of data checking point? * How data checking point in Spark Streaming works? Does it need to load all previous checking point data in order to build new one? My job is very simple: JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(2)); JavaPairInputDStream<String, String> messages = KafkaUtils.createDirectStream(...); JavaPairDStream<String, List<Double>> stats = messages.mapToPair(parseJson) .reduceByKey(REDUCE_STATS) .updateStateByKey(RUNNING_STATS); stats.print()