Hi Andrian,

Thanks for the information.

However your 2 suggestions couldn't really work for me.

Accuracy is the most important aspect in my application. So keeping only 15
minutes window stats or prune out some of keys is impossible for my
application.

I can change the checking point interval as your suggestion,
however is there any other Spark configuration to turning the data
checkpoint performance?

And just curious, technically why updateStateByKey need to be called for
very key (regardless the new occurrences or not)? Does Spark has any plan
to fix it?
I have 4M keys need to maintain the statistics however only few of them are
changed in each batch interval.

2015-11-02 22:37 GMT+07:00 Adrian Tanase <atan...@adobe.com>:

> You are correct, the default checkpointing interval is 10 seconds or your
> batch size, whichever is bigger. You can change it by calling
> .checkpoint(x) on your resulting Dstream.
>
> For the rest, you are probably keeping an “all time” word count that grows
> unbounded if you never remove words from the map. Keep in mind that
> updateStateByKey is called for every key in the state RDD, regardless if
> you have new occurrences or not.
>
> You should consider at least one of these strategies:
>
>    - run your word count on a windowed Dstream (e.g. Unique counts over
>    the last 15 minutes)
>       - Your best bet here is reduceByKeyAndWindow with an inverse
>       function
>    - Make your state object more complicated and try to prune out words
>    with very few occurrences or that haven’t been updated for a long time
>       - You can do this by emitting None from updateStateByKey
>
> Hope this helps,
> -adrian
>
> From: Thúy Hằng Lê
> Date: Monday, November 2, 2015 at 7:20 AM
> To: "user@spark.apache.org"
> Subject: Spark Streaming data checkpoint performance
>
> JavaStreamingContext jssc = new JavaStreamingContext(sparkConf,
> Durations.seconds(2));
>

Reply via email to