Hi all,

It is currently difficult to understand from the Spark docs or the
materials online that I came across, how the updateStateByKey
and mapWithState operators in Spark Streaming scale with the size of the
state and how to reason about sizing the cluster appropriately.

According to this article:
https://databricks.com/blog/2016/02/01/faster-stateful-stream-processing-in-apache-spark-streaming.html
mapWithState can handle a lot more state than updateStateByKey but the
discussion there is in terms of number of keys without details about
cluster sizes. What about size in Gb?

I have ~100GB worth of state, not all of it updating all the time
obviously, will Spark be able to handle that with these operators? (none of
the state expires)

How big does the cluster have to be to handle this reliably and offer an
uninterrupted service (number of nodes, memory size per node etc)?

How can you deal with bootstrapping?

What about code upgrades?

Ideally I would like to keep my state in Spark as not to manage an external
data store for it. What is not clear to me is what is the size of state
where I have to move from keeping state in Spark to keeping it in an
external data store?

Thanks

Reply via email to