[ https://issues.apache.org/jira/browse/FLINK-24086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17408273#comment-17408273 ]
Dawid Wysakowicz edited comment on FLINK-24086 at 9/9/21, 1:00 PM: ------------------------------------------------------------------- {quote}We implemented a new failover strategy (by discarding some data to only restart failed tasks) {quote} Um... This can be ignored. It can be considered that the job has a full restart and is restored from the checkpoint. {quote}But now we don't restore CompleteCheckpointStore again, this problem will no longer exist {quote} According to the issue of FLINK-22483, we will not recover the {{CompletedCheckpointStore}} every time. Therefore, if we reuse the same {{SharedStateRegistry}} during restore and do not clear it, asynchronous deletion will not cause the reference count of {{SharedState}} to be less than 1. So, this can reduce the recovery time. was (Author: ming li): {quote}We implemented a new failover strategy (by discarding some data to only restart failed tasks) {quote} Um... This can be ignored. It can be considered that the job has a full restart and is restored from the checkpoint. {quote}But now we don't restore CompleteCheckpointStore again, this problem will no longer exist {quote} According to the issue of FLINK-22483, we will not recover the {{CompletedCheckpointStore}} every time. Therefore, if we reuse the same{{ SharedStateRegistry}} during restore and do not clear it, asynchronous deletion will not cause the reference count of {{SharedState}} to be less than 1. So, this can reduce the recovery time. > Do not re-register SharedStateRegistry to reduce the recovery time of the job > ----------------------------------------------------------------------------- > > Key: FLINK-24086 > URL: https://issues.apache.org/jira/browse/FLINK-24086 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing, Runtime / Coordination > Affects Versions: 1.14.0 > Reporter: ming li > Assignee: ming li > Priority: Major > Labels: pull-request-available > Fix For: 1.15.0 > > > At present, we only recover the {{CompletedCheckpointStore}} when the > {{JobManager}} starts, so it seems that we do not need to re-register the > {{SharedStateRegistry}} when the task restarts. > The reason for this issue is that in our production environment, we discard > part of the data and state to only restart the failed task, but found that it > may take several seconds to register the {{SharedStateRegistry}} (thousands > of tasks and dozens of TB states). When there are a large number of task > failures at the same time, this may take several minutes (number of tasks * > several seconds). > Therefore, if the {{SharedStateRegistry}} can be reused, the time for task > recovery can be reduced. -- This message was sent by Atlassian Jira (v8.3.4#803005)