Hi all, It is currently difficult to understand from the Spark docs or the materials online that I came across, how the updateStateByKey and mapWithState operators in Spark Streaming scale with the size of the state and how to reason about sizing the cluster appropriately.
According to this article: https://databricks.com/blog/2016/02/01/faster-stateful-stream-processing-in-apache-spark-streaming.html mapWithState can handle a lot more state than updateStateByKey but the discussion there is in terms of number of keys without details about cluster sizes. What about size in Gb? I have ~100GB worth of state, not all of it updating all the time obviously, will Spark be able to handle that with these operators? (none of the state expires) How big does the cluster have to be to handle this reliably and offer an uninterrupted service (number of nodes, memory size per node etc)? How can you deal with bootstrapping? What about code upgrades? Ideally I would like to keep my state in Spark as not to manage an external data store for it. What is not clear to me is what is the size of state where I have to move from keeping state in Spark to keeping it in an external data store? Thanks