So, one portion of our Spark streaming application requires some state. Our application takes a bunch of application events (i.e. user_session_started, user_session_ended, etc..), and calculates out metrics from these, and writes them to a serving layer (see: Lambda Architecture). Two related events can be ingested into the streaming context moments apart, or time inderminate. Given this, and the fact that our normal windows pump data out every 500-10000 ms, with a step of 500ms, you might end up with two related pieces of data across two windows. In order to work around this, we go ahead and do updateStateByKey to persist state, as opposed to persisting key intermediate state in some external system, as building a system to handle the complexities of (concurrent, idempotent) updates, as well as ensure scalability is non-trivial.
The questions I have around this, is even in a highly-partitionable dataset, what's the upper "scalability" limits with stateful dstreams? If I have a dataset, starting at around 10-million keys, growing at that rate monthly, what are the complexities within? Most of the data is cold. I realize that I can remove data from the stateful dstream, by sending (key, null) to it, but there is not necessarily an easy way of knowing when the last update is coming in (unless there is some way in spark of saying, "Wait N windows, and send this tuple" or "Wait until all keys in the upstream Dstreams smaller than M are processed" before sending such a tuple. Additionally, given that my data is partitionable by datetime, does it make sense to have a custom datetime partitioner, and just persist the dstream to disk, to ensure that its RDDs are only pulled off of disk (into memory) occasionally? What's the cost of having a bunch of relatively large, stateful RDDs around on disk? Does Spark have to load / deserialize the entire RDD to update one key?