The URL in my previous mail is wrong, and it should be: 

https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/large_state_tuning.html#task-local-recovery
 
<https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/large_state_tuning.html#task-local-recovery>

Best,
Paul Lam

> 在 2019年4月18日,18:04,Paul Lam <paullin3...@gmail.com> 写道:
> 
> Hi,
> 
> Have you tried task local recovery [1]?
> 
> [1] 
> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/checkpoints.html#retained-checkpoints
>  
> <https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/checkpoints.html#retained-checkpoints>
> 
> Best,
> Paul Lam
> 
>> 在 2019年4月17日,17:46,Sergey Zhemzhitsky <szh.s...@gmail.com 
>> <mailto:szh.s...@gmail.com>> 写道:
>> 
>> Hi Flinkers,
>> 
>> Operating different flink jobs I've discovered that job restarts with
>> a pretty large state (in my case this is up to 100GB+) take quite a
>> lot of time. For example, to restart a job (e.g. to update it) the
>> savepoint is created, and in case of savepoints all the state seems to
>> be pushed into the distributed store (hdfs in my case) when stopping a
>> job and pulling this state back when starting the new version of the
>> job.
>> 
>> What I've found by the moment trying to speed up job restarts is:
>> - using external retained checkpoints [1]; the drawback is that the
>> job cannot be rescaled during restart
>> - using external state and storage with the stateless jobs; the
>> drawback is the necessity of additional network hops to this storage.
>> 
>> So I'm wondering whether there are any best practices community knows
>> and uses to cope with the cases like this?
>> 
>> [1] 
>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/checkpoints.html#retained-checkpoints
>>  
>> <https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/checkpoints.html#retained-checkpoints>
> 

Reply via email to