You are correct, the default checkpointing interval is 10 seconds or your batch size, whichever is bigger. You can change it by calling .checkpoint(x) on your resulting Dstream.
For the rest, you are probably keeping an “all time” word count that grows unbounded if you never remove words from the map. Keep in mind that updateStateByKey is called for every key in the state RDD, regardless if you have new occurrences or not. You should consider at least one of these strategies: * run your word count on a windowed Dstream (e.g. Unique counts over the last 15 minutes) * Your best bet here is reduceByKeyAndWindow with an inverse function * Make your state object more complicated and try to prune out words with very few occurrences or that haven’t been updated for a long time * You can do this by emitting None from updateStateByKey Hope this helps, -adrian From: Thúy Hằng Lê Date: Monday, November 2, 2015 at 7:20 AM To: "user@spark.apache.org<mailto:user@spark.apache.org>" Subject: Spark Streaming data checkpoint performance JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(2));