Best way to avoid updateStateByKey from running without data

2015-07-10 Thread micvog
UpdateStateByKey will run the update function on every interval, even if the incoming batch is empty. Is there a way to prevent that? If the incoming DStream contains no RDDs (or RDDs of count 0) then I don't want my update function to run. Note that this is different from running the update

Does spark guarantee that the same task will process the same key over time?

2015-07-09 Thread micvog
For example in the simplest word count example, I want to update the count in memory and always have the same word getting updated by the same task - not use any distributed memstore. I know that updateStateByKey should guarantee that, but how do you approach this problem outside of spark

Spark Streaming broadcast to all keys

2015-07-03 Thread micvog
UpdateStateByKey is useful but what if I want to perform an operation to all existing keys (not only the ones in this RDD). Word count for example - is there a way to decrease *all* words seen so far by 1? I was thinking of keeping a static class per node with the count information and issuing a