[ https://issues.apache.org/jira/browse/FLINK-14971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125989#comment-17125989 ]
Stephan Ewen edited comment on FLINK-14971 at 6/4/20, 3:02 PM: --------------------------------------------------------------- In the second step I was referring to "committing asynchronously", because that is also a blocking operation (write to ZooKeeper). However, committing asynchronously is complex because there is a time when the scheduler can ask for a checkpoint but it is not clear which the latest one is (due do async committing). I think we can only approach this once we have support for async restore in the scheduler. Concerning the cleanup problem: This should happen asychronously (not block JM and not block committing) but it needs to backpressure new checkpoint creation. It sounds to me like the best way would be to take this into account when triggering checkpoints, as an additional condition. For example, under default settings, a new checkpoint can only be triggered if no other periodic checkpoint is in progress, and there is no more than one checkpoint pending under cleanup. was (Author: stephanewen): In the second step I was referring to "committing asynchronously", because that is also a blocking operation (write to ZooKeeper). However, committing asynchronously is complex because there is a time when the scheduler can ask for a checkpoint but it is not clear which the latest one is (due do async committing). Concerning the cleanup problem: This should happen asychronously (not block JM and not block committing) but it needs to backpressure new checkpoint creation. It sounds to me like the best way would be to take this into account when triggering checkpoints, as an additional condition. For example, under default settings, a new checkpoint can only be triggered if no other periodic checkpoint is in progress, and there is no more than one checkpoint pending under cleanup. > Make all the non-IO operations in CheckpointCoordinator single-threaded > ----------------------------------------------------------------------- > > Key: FLINK-14971 > URL: https://issues.apache.org/jira/browse/FLINK-14971 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Checkpointing > Reporter: Biao Liu > Assignee: Biao Liu > Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Currently the ACK and declined message handling are executed in IO thread. > This is the only rest part that non-IO operations are executed in IO thread. > It blocks introducing main thread executor for {{CheckpointCoordinator}}. It > would be resolved in this task. > After resolving the ACK and declined message issue, the main thread executor > would be introduced into {{CheckpointCoordinator}} to instead of timer > thread. However the timer thread would be kept (maybe for a while > temporarily) to schedule periodic triggering, since FLINK-13848 is not > accepted yet. -- This message was sent by Atlassian Jira (v8.3.4#803005)