The URL in my previous mail is wrong, and it should be: https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/large_state_tuning.html#task-local-recovery <https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/large_state_tuning.html#task-local-recovery>
Best, Paul Lam > 在 2019年4月18日,18:04,Paul Lam <paullin3...@gmail.com> 写道: > > Hi, > > Have you tried task local recovery [1]? > > [1] > https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/checkpoints.html#retained-checkpoints > > <https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/checkpoints.html#retained-checkpoints> > > Best, > Paul Lam > >> 在 2019年4月17日,17:46,Sergey Zhemzhitsky <szh.s...@gmail.com >> <mailto:szh.s...@gmail.com>> 写道: >> >> Hi Flinkers, >> >> Operating different flink jobs I've discovered that job restarts with >> a pretty large state (in my case this is up to 100GB+) take quite a >> lot of time. For example, to restart a job (e.g. to update it) the >> savepoint is created, and in case of savepoints all the state seems to >> be pushed into the distributed store (hdfs in my case) when stopping a >> job and pulling this state back when starting the new version of the >> job. >> >> What I've found by the moment trying to speed up job restarts is: >> - using external retained checkpoints [1]; the drawback is that the >> job cannot be rescaled during restart >> - using external state and storage with the stateless jobs; the >> drawback is the necessity of additional network hops to this storage. >> >> So I'm wondering whether there are any best practices community knows >> and uses to cope with the cases like this? >> >> [1] >> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/checkpoints.html#retained-checkpoints >> >> <https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/checkpoints.html#retained-checkpoints> >