Re: Fast restart of a job with a large state

Sergey Zhemzhitsky Wed, 24 Apr 2019 10:28:16 -0700

Hi Till,

Thanks for the info!
It's good to know.


Regards,
Sergey


On Wed, Apr 24, 2019, 13:08 Till Rohrmann <trohrm...@apache.org> wrote:

> Hi Sergey,
>
> at the moment neither local nor incremental savepoints are supported in
> Flink afaik. There were some ideas wrt incremental savepoints floating
> around in the community but nothing concrete yet.
>
> Cheers,
> Till
>
> On Tue, Apr 23, 2019 at 6:58 PM Sergey Zhemzhitsky <szh.s...@gmail.com>
> wrote:
>
>> Hi Stefan, Paul,
>>
>> Thanks for the tips! Currently I have not tried neither rescaling from
>> checkpoints nor task local recovery. Now it's a subject to test.
>>
>> In case it will be necessary not to just rescale a job, but also to
>> change its DAG - is there a way to have something like let's call it "local
>> savepoints" or "incremental savepoints" to prevent the whole state
>> transferring to and from a distributed storage?
>>
>> Kind Regards,
>> Sergey
>>
>>
>> On Thu, Apr 18, 2019, 13:22 Stefan Richter <s.rich...@ververica.com>
>> wrote:
>>
>>> Hi,
>>>
>>> If rescaling is the problem, let me clarify that you can currently
>>> rescale from savepoints and all types of checkpoints (including
>>> incremental). If that was the only problem, then there is nothing to worry
>>> about - the documentation is only a bit conservative about this because we
>>> will not commit to an APU that all future types checkpoints will be
>>> resealable. But currently they are all, and this is also very unlikely to
>>> change anytime soon.
>>>
>>> Paul, just to comment on your suggestion as well, local recovery would
>>> only help with failover. 1) It does not help for restarts by the user and
>>> 2) also does not work for rescaling (2) is a consequence of 1) because
>>> failover never rescales, only restarts).
>>>
>>> Best,
>>> Stefan
>>>
>>> On 18. Apr 2019, at 12:07, Paul Lam <paullin3...@gmail.com> wrote:
>>>
>>> The URL in my previous mail is wrong, and it should be:
>>>
>>>
>>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/large_state_tuning.html#task-local-recovery
>>>
>>> Best,
>>> Paul Lam
>>>
>>> 在 2019年4月18日，18:04，Paul Lam <paullin3...@gmail.com> 写道：
>>>
>>> Hi,
>>>
>>> Have you tried task local recovery [1]?
>>>
>>> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/checkpoints.html#retained-checkpoints
>>>
>>> Best,
>>> Paul Lam
>>>
>>> 在 2019年4月17日，17:46，Sergey Zhemzhitsky <szh.s...@gmail.com> 写道：
>>>
>>> Hi Flinkers,
>>>
>>> Operating different flink jobs I've discovered that job restarts with
>>> a pretty large state (in my case this is up to 100GB+) take quite a
>>> lot of time. For example, to restart a job (e.g. to update it) the
>>> savepoint is created, and in case of savepoints all the state seems to
>>> be pushed into the distributed store (hdfs in my case) when stopping a
>>> job and pulling this state back when starting the new version of the
>>> job.
>>>
>>> What I've found by the moment trying to speed up job restarts is:
>>> - using external retained checkpoints [1]; the drawback is that the
>>> job cannot be rescaled during restart
>>> - using external state and storage with the stateless jobs; the
>>> drawback is the necessity of additional network hops to this storage.
>>>
>>> So I'm wondering whether there are any best practices community knows
>>> and uses to cope with the cases like this?
>>>
>>> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/checkpoints.html#retained-checkpoints
>>>
>>>
>>>
>>>
>>>

Re: Fast restart of a job with a large state

Reply via email to