Hi,

Savepoints internally work in the exact same way as checkpoints.

I'm not sure what you are referring to as a repartitioning step. If you
mean rescalling (changing the parallelism), then this can happen both for
the checkpoints and savepoints. You can find answers to a lot of such
questions by just searching on Google. About rescaling you can read here
[1].

Piotrek
[1] https://flink.apache.org/features/2017/07/04/flink-rescalable-state.html

pon., 18 sty 2021 o 17:58 Rex Fenley <r...@remind101.com> napisał(a):

> This is great information, thank you. It does look like task local
> recovery only works for checkpoints, however we do want to bring down our
> recovery times so this is useful.
>
> I'm wondering what all is involved with savepoints though too. Savepoint
> restoration must have some repartitioning step too I'd imagine which seems
> like it could be fairly involved? Anything else I'm missing?
>
> Thanks!
>
> On Mon, Jan 18, 2021 at 2:49 AM Piotr Nowojski <pnowoj...@apache.org>
> wrote:
>
>> Hi Rex,
>>
>> Good that you have found the source of your problem and thanks for
>> reporting it back.
>>
>> Regarding your question about the recovery steps (ignoring scheduling and
>> deploying). I think it depends on the used state backend. From your
>> other emails I see you are using RocksDB, so I believe this is the big
>> picture how it works in the RocksDB case:
>>
>> 1. Relevant state files are loaded from the DFS to local disks of a
>> TaskManager [1].
>> 2. I presume RocksDB needs to load at least some meta data from those
>> local files in order to finish the recovery process (I doubt it but maybe
>> it also needs to load some actual data).
>> 3. Task can start processing the records.
>>
>> Best,
>> Piotrek
>>
>> [1] This step can be avoided if you are using local recovery
>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/large_state_tuning.html#task-local-recovery
>>
>> sob., 16 sty 2021 o 06:15 Rex Fenley <r...@remind101.com> napisał(a):
>>
>>> Aftering looking through lots of graphs and AWS limits. I've come to the
>>> conclusion that we're hitting limits on our disk writes. I'm guessing this
>>> is backpressuring against the entire restore process. I'm still very
>>> curious about all the steps involved in savepoint restoration though!
>>>
>>> On Fri, Jan 15, 2021 at 7:50 PM Rex Fenley <r...@remind101.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> We have a savepoint that's ~0.5 TiB in size. When we try to restore
>>>> from it, we time out because it takes too long (write now checkpoint
>>>> timeouts are set to 2 hours which is way above where we want them already).
>>>>
>>>> I'm curious if it needs to download the entire savepoint to continue.
>>>> Or, for further education, what are all the operations that take place
>>>> before a job is restored from a savepoint?
>>>>
>>>> Additionally, the network seems to be a big bottleneck. Our network
>>>> should be operating in the GiB/s range per instance, but seems to operate
>>>> between 70-100MiB per second when retrieving a savepoint. Are there any
>>>> constraining factors in Flink's design that would slow down the network
>>>> download of a savepoint this much (from S3)?
>>>>
>>>> Thanks!
>>>>
>>>> --
>>>>
>>>> Rex Fenley  |  Software Engineer - Mobile and Backend
>>>>
>>>>
>>>> Remind.com <https://www.remind.com/> |  BLOG <http://blog.remind.com/>
>>>>  |  FOLLOW US <https://twitter.com/remindhq>  |  LIKE US
>>>> <https://www.facebook.com/remindhq>
>>>>
>>>
>>>
>>> --
>>>
>>> Rex Fenley  |  Software Engineer - Mobile and Backend
>>>
>>>
>>> Remind.com <https://www.remind.com/> |  BLOG <http://blog.remind.com/>
>>>  |  FOLLOW US <https://twitter.com/remindhq>  |  LIKE US
>>> <https://www.facebook.com/remindhq>
>>>
>>
>
> --
>
> Rex Fenley  |  Software Engineer - Mobile and Backend
>
>
> Remind.com <https://www.remind.com/> |  BLOG <http://blog.remind.com/>  |
>  FOLLOW US <https://twitter.com/remindhq>  |  LIKE US
> <https://www.facebook.com/remindhq>
>

Reply via email to