You are correct, the default checkpointing interval is 10 seconds or your batch 
size, whichever is bigger. You can change it by calling .checkpoint(x) on your 
resulting Dstream.

For the rest, you are probably keeping an “all time” word count that grows 
unbounded if you never remove words from the map. Keep in mind that 
updateStateByKey is called for every key in the state RDD, regardless if you 
have new occurrences or not.

You should consider at least one of these strategies:

  *   run your word count on a windowed Dstream (e.g. Unique counts over the 
last 15 minutes)
     *   Your best bet here is reduceByKeyAndWindow with an inverse function
  *   Make your state object more complicated and try to prune out words with 
very few occurrences or that haven’t been updated for a long time
     *   You can do this by emitting None from updateStateByKey

Hope this helps,
-adrian

From: Thúy Hằng Lê
Date: Monday, November 2, 2015 at 7:20 AM
To: "user@spark.apache.org<mailto:user@spark.apache.org>"
Subject: Spark Streaming data checkpoint performance

JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, 
Durations.seconds(2));

Reply via email to