Some background in the Incremental Checkpointing: It is not in the system, but we have a quite advanced design and some committers/contributors are currently starting the effort.
My personal estimate is that it would be available in some months (Q1 next year). Best, Stephan On Sat, Nov 19, 2016 at 4:07 PM, Gyula Fóra <gyula.f...@gmail.com> wrote: > Hi Steven, > > As Robert said some of our jobs have state sizes around a TB or more. We > use the RocksDB state backend with some configs tuned to perform well on > SSDs (you can get some tips here: https://www.youtube.com/watch? > v=pvUqbIeoPzM). > > We checkpoint our state to Ceph (similar to HDFS but this is what we have > :)), and it takes 15-25 minutes for the larger jobs to perform the > checkpoints/restore. As this runs async in the background it doesnt hurt > our runtime performance, the only problems are with the strain on the > network sometimes especially when many jobs are restored at the same time. > > Incremental checkpoints would definitely be crazy useful in our case as > only a very small percentage of our state is updated between snapshots but > it is still feasible as it is for now. > > Let me know if I can help with any details. > > Cheers, > Gyula > > Robert Metzger <rmetz...@apache.org> ezt írta (időpont: 2016. nov. 19., > Szo, 13:16): > >> Hi Steven, >> >> According to this presentation, King.com is using Flink with terabytes of >> state: http://flink-forward.org/wp-content/uploads/2016/ >> 07/Gyulo-Fo%CC%81ra-RBEA-Scalable-Real-Time-Analytics- >> at-King.compressed.pdf (see Page 4 specifically) >> >> For the 90GB experiment, what is the expected time for transferring 90 GB >> of data in your environment? >> >> Regards, >> Robert >> >> >> On Sat, Nov 19, 2016 at 1:41 AM, Steven Ruppert <ste...@fullcontact.com> >> wrote: >> >> Hi, >> >> Is anybody currently running flink streaming with north of a terabyte >> (TB) of managed state? If you are, can you share your experiences wrt >> hardware, tuning, recovery situations, etc? >> >> I'm evaluating flink for a use case I estimate will take around 5TB of >> state in total, but looking at the actual implementation of the >> rocksDB state and current lack of incremental checkpointing or >> recovery, it doesn't seem feasible. >> >> I have successfully tested flink up to roughly 90GB of managed state >> in rocksDB, but that's taking 5 minutes to checkpoint or recover (on a >> pretty beefy YARN cluster). >> >> For most cases, my state updates are idempotent and can be moved to >> something external. However, it'd be nice to know of any current of >> future plans for running flink at the terabyte scale. >> >> --Steven >> >> -- >> *CONFIDENTIALITY NOTICE: This email message, and any documents, files or >> previous e-mail messages attached to it is for the sole use of the >> intended >> recipient(s) and may contain confidential and privileged information. Any >> unauthorized review, use, disclosure or distribution is prohibited. If you >> are not the intended recipient, please contact the sender by reply email >> and destroy all copies of the original message.* >> >> >>