This is great information, thank you. It does look like task local recovery
only works for checkpoints, however we do want to bring down our recovery
times so this is useful.

I'm wondering what all is involved with savepoints though too. Savepoint
restoration must have some repartitioning step too I'd imagine which seems
like it could be fairly involved? Anything else I'm missing?

Thanks!

On Mon, Jan 18, 2021 at 2:49 AM Piotr Nowojski <pnowoj...@apache.org> wrote:

> Hi Rex,
>
> Good that you have found the source of your problem and thanks for
> reporting it back.
>
> Regarding your question about the recovery steps (ignoring scheduling and
> deploying). I think it depends on the used state backend. From your
> other emails I see you are using RocksDB, so I believe this is the big
> picture how it works in the RocksDB case:
>
> 1. Relevant state files are loaded from the DFS to local disks of a
> TaskManager [1].
> 2. I presume RocksDB needs to load at least some meta data from those
> local files in order to finish the recovery process (I doubt it but maybe
> it also needs to load some actual data).
> 3. Task can start processing the records.
>
> Best,
> Piotrek
>
> [1] This step can be avoided if you are using local recovery
> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/large_state_tuning.html#task-local-recovery
>
> sob., 16 sty 2021 o 06:15 Rex Fenley <r...@remind101.com> napisaƂ(a):
>
>> Aftering looking through lots of graphs and AWS limits. I've come to the
>> conclusion that we're hitting limits on our disk writes. I'm guessing this
>> is backpressuring against the entire restore process. I'm still very
>> curious about all the steps involved in savepoint restoration though!
>>
>> On Fri, Jan 15, 2021 at 7:50 PM Rex Fenley <r...@remind101.com> wrote:
>>
>>> Hello,
>>>
>>> We have a savepoint that's ~0.5 TiB in size. When we try to restore from
>>> it, we time out because it takes too long (write now checkpoint timeouts
>>> are set to 2 hours which is way above where we want them already).
>>>
>>> I'm curious if it needs to download the entire savepoint to continue.
>>> Or, for further education, what are all the operations that take place
>>> before a job is restored from a savepoint?
>>>
>>> Additionally, the network seems to be a big bottleneck. Our network
>>> should be operating in the GiB/s range per instance, but seems to operate
>>> between 70-100MiB per second when retrieving a savepoint. Are there any
>>> constraining factors in Flink's design that would slow down the network
>>> download of a savepoint this much (from S3)?
>>>
>>> Thanks!
>>>
>>> --
>>>
>>> Rex Fenley  |  Software Engineer - Mobile and Backend
>>>
>>>
>>> Remind.com <https://www.remind.com/> |  BLOG <http://blog.remind.com/>
>>>  |  FOLLOW US <https://twitter.com/remindhq>  |  LIKE US
>>> <https://www.facebook.com/remindhq>
>>>
>>
>>
>> --
>>
>> Rex Fenley  |  Software Engineer - Mobile and Backend
>>
>>
>> Remind.com <https://www.remind.com/> |  BLOG <http://blog.remind.com/>
>>  |  FOLLOW US <https://twitter.com/remindhq>  |  LIKE US
>> <https://www.facebook.com/remindhq>
>>
>

-- 

Rex Fenley  |  Software Engineer - Mobile and Backend


Remind.com <https://www.remind.com/> |  BLOG <http://blog.remind.com/>
 |  FOLLOW
US <https://twitter.com/remindhq>  |  LIKE US
<https://www.facebook.com/remindhq>

Reply via email to