UpdateStateByKey will run the update function on every interval, even if the
incoming batch is empty. Is there a way to prevent that? If the incoming
DStream contains no RDDs (or RDDs of count 0) then I don't want my update
function to run.
Note that this is different from running the update
For example in the simplest word count example, I want to update the count in
memory and always have the same word getting updated by the same task - not
use any distributed memstore.
I know that updateStateByKey should guarantee that, but how do you approach
this problem outside of spark
UpdateStateByKey is useful but what if I want to perform an operation to all
existing keys (not only the ones in this RDD).
Word count for example - is there a way to decrease *all* words seen so far
by 1?
I was thinking of keeping a static class per node with the count information
and issuing a