Re: Flink streaming with 1+ TB of managed state

Steven Ruppert Mon, 21 Nov 2016 09:45:18 -0800

Some responses inline below:

On Sat, Nov 19, 2016 at 4:07 PM, Gyula Fóra <gyula.f...@gmail.com
<mailto:gyula.f...@gmail.com>> wrote:


Hi Steven,

As Robert said some of our jobs have state sizes around a TB or
more. We use the RocksDB state backend with some configs tuned to
perform well on SSDs (you can get some tips here:
https://www.youtube.com/watch?v=pvUqbIeoPzM
<https://www.youtube.com/watch?v=pvUqbIeoPzM>).

We checkpoint our state to Ceph (similar to HDFS but this is what we
 have :)), and it takes 15-25 minutes for the larger jobs to perform
 the checkpoints/restore. As this runs async in the background it
doesnt hurt our runtime performance, the only problems are with the
strain on the network sometimes especially when many jobs are
restored at the same time.

Incremental checkpoints would definitely be crazy useful in our case
 as only a very small percentage of our state is updated between
snapshots but it is still feasible as it is for now.

Let me know if I can help with any details.

Cheers, Gyula


Thanks Gyula. I do have some additional questions if you are able to answer:

1. How often do you checkpoint/savepoint your jobs? Do you "pipeline"
checkpoints and allow multiple transfers to happen at the same time?

2. Do you have any additional fault tolerance layers in your jobs? It
seems like if some hardware fault or software bug does manage to fail
the job, it's at least 15-25 minutes (plus catchup time) until the job
is available again.

3. In slide 14 of your presentation, you mention an LRU cache in front
of the RocksDB state. Can you give any additional details about that?
Are there any particular deficiencies in vanilla rocksDB state backend
that the LRU cache works around?

4. (This is perhaps a more general flink question) If you make a change
to your jobs that requires recreating the entire 1+ TB of state from the
beginning of the input, do you do anything special to backfill the 1TB
of state, or do you simply run the same streaming job from the beginning?

There is Uber's presentation on this from Flink Forward 2016:
https://youtu.be/9mjAPBNl4YM . I'm curious if you have any other techniques.

***

With the project for which I'm currently vetting flink, I'm actually not
so concerned with the performance of rocksDB state backend itself, both
the read/update/write performance and the checkpointed data transfer
performance. I was testing with async checkpointing, and it does seem
feasible to have those running relatively frequently.

I'm still a little concerned that if it does take upwards of 10 minutes
to checkpoint 1TB of state, the downtime in case of a failure is at
least 10 minutes, which is hard to work around.

On 11/21/2016 07:47 AM, Stephan Ewen wrote:

Some background in the Incremental Checkpointing: It is not in the
system, but we have a quite advanced design and some
committers/contributors are currently starting the effort.


That sounds good. Can you link to any publicly available design docs and
code PRs/branches? I'm pretty sure I came across them before, but my
searching is failing me at the moment.

--Steven

--

*CONFIDENTIALITY NOTICE: This email message, and any documents, files orprevious e-mail messages attached to it is for the sole use of the intendedrecipient(s) and may contain confidential and privileged information. Anyunauthorized review, use, disclosure or distribution is prohibited. If youare not the intended recipient, please contact the sender by reply emailand destroy all copies of the original message.*

Re: Flink streaming with 1+ TB of managed state

Reply via email to