Hello all, We have read in multiple <https://flink.apache.org/features/2018/01/30/incremental-checkpointing.html> sources <https://flink.apache.org/usecases.html> that Flink has been used for use cases with terabytes of application state.
We are considering using Flink for a similar use case with* keyed state in the range of 20 to 30 TB*. We had a few questions regarding the same. - *Is Flink a good option for this kind of scale of data* ? We are considering using RocksDB as the state backend. - *What happens when we want to add a node to the cluster *? - As per our understanding, if we have 10 nodes in our cluster, with 20TB of state, this means that adding a node would require the entire 20TB of data to be shipped again from the external checkpoint remote storage to the taskmanager nodes. - Assuming 1Gb/s network speed, and assuming all nodes can read their respective 2TB state parallely, this would mean a *minimum downtime of half an hour*. And this is assuming the throughput of the remote storage does not become the bottleneck. - Is there any way to reduce this estimated downtime ? Thank you!