Hello all,
We have read in multiple
<https://flink.apache.org/features/2018/01/30/incremental-checkpointing.html>
sources <https://flink.apache.org/usecases.html> that Flink has been used
for use cases with terabytes of application state.

We are considering using Flink for a similar use case with* keyed state in
the range of 20 to 30 TB*. We had a few questions regarding the same.


   - *Is Flink a good option for this kind of scale of data* ? We are
   considering using RocksDB as the state backend.
   - *What happens when we want to add a node to the cluster *?
      - As per our understanding, if we have 10 nodes in our cluster, with
      20TB of state, this means that adding a node would require the
entire 20TB
      of data to be shipped again from the external checkpoint remote
storage to
      the taskmanager nodes.
      - Assuming 1Gb/s network speed, and assuming all nodes can read their
      respective 2TB state parallely, this would mean a *minimum downtime
      of half an hour*. And this is assuming the throughput of the remote
      storage does not become the bottleneck.
      - Is there any way to reduce this estimated downtime ?


Thank you!

Reply via email to