1996fanrui opened a new pull request, #19723: URL: https://github.com/apache/flink/pull/19723
## What is the purpose of the change Support upstream Task switching from Aligned Checkpoint to Unaligned Checkpoint to improve the UC when the back pressure is severe. When the barrier cannot be sent from the output buffer to the downstream task within the [execution.checkpointing.aligned-checkpoint-timeout](https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/config/#execution-checkpointing-aligned-checkpoint-timeout), the upstream task switches to UC and takes a snapshot of the data before the barrier in the output buffer. ## Brief change log - *When broadcasting Barrier, if checkpointOptions.isTimeoutable(), register timer.* - *The time when the timer is triggered is: CP triggerTime + alignmentTimeout* - *Behavior when the timer is triggered: snapshot all buffers that have not been sent to the downstream before the Barrier* - *If the UC is enabled and the Aligned Barrier is received, the CP cannot succeed immediately and needs to wait for the outputBufferFuture:* - *When all barriers are sent downstream quickly, and an empty output buffer is returned, the CP ends.(when the back pressure isn't severe)* - *Or wait for the timer to trigger, snapshot these output buffers before the barrier, and the CP ends.(when the back pressure is severe)* - *If the CP is aborted, flink will execute the`channelStateFuture.completeExceptionally(cause)`* ## Verifying this change This change added tests and can be verified as follows: - Finish unit test later. - This PR will affect the Unaligned checkpoint benchmark. We can view the benefit after merge [here](http://codespeed.dak8s.net:8000/timeline/#/?exe=1,6&ben=checkpointSingleInput.UNALIGNED_1&env=2&revs=200&equid=off&quarts=on&extr=on). ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): no - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no - The serializers: no - The runtime per-record code paths (performance sensitive): no - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no - The S3 file system connector: no ## Documentation - Does this pull request introduce a new feature? no - If yes, how is the feature documented? docs, it need to update some doc of UC. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@flink.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org