[ https://issues.apache.org/jira/browse/FLINK-27251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534325#comment-17534325 ]
Piotr Nowojski commented on FLINK-27251: ---------------------------------------- Thanks for rising the issue [~fanrui]. Yes, this is a known problem. While developing the unaligned checkpoints, and especially when adding the timeouts support, the timeouts proved very difficult to implement, causing lot's of critical bugs and requiring a lot of effort to debug data corruption and stabilise the feature. All in all, in the retrospect, our feel was that adding the timeouts was not worth the effort and that users should be just as fine using the unaligned checkpoints without any timeout. At one point I was even thinking about removing feature all together in order to simplify the code base. The main motivation issue is that without backpressure unaligned checkpoints will capture only very negligible amount of the in-flight data, and with backpressure, you most likely want to have fully unaligned checkpoints anyway, so actually we don't see a clear benefit of enabling timeout in the first place. From this perspective, I would like to first discuss if we even need this feature. Secondly, assuming that we really need it, one would have to very carefully think how to implement it. Note that if you exceed the time limit on the upstream subtask's output to send aligned barriers, when you want to convert those barriers to unaligned checkpoint, this subtask has already completed the checkpoint. While the timeout process would have to append the output in-flight data to the checkpoint. > Solve the problem that upstream Task cannot be switched to Unaligned > Checkpoint > ------------------------------------------------------------------------------- > > Key: FLINK-27251 > URL: https://issues.apache.org/jira/browse/FLINK-27251 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing > Affects Versions: 1.14.0, 1.15.0 > Reporter: fanrui > Priority: Major > Fix For: 1.16.0 > > > After FLINK-23041, the downstream task can be switched UC when {_}currentTime > - triggerTime > timeout{_}. But the downstream task still needs wait for all > barriers of upstream. > If the back pressure is serve, the downstream task cannot receive all barrier > within CP timeout, causes CP to fail. > > Can we support upstream Task switching from Aligned to UC? It means that when > the barrier cannot be sent from the output buffer to the downstream task > within the > [execution.checkpointing.aligned-checkpoint-timeout|https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/config/#execution-checkpointing-aligned-checkpoint-timeout], > the upstream task switches to UC and takes a snapshot of the data before the > barrier in the output buffer. > > Hi [~akalashnikov] , please help take a look in your free time, thanks a lot. -- This message was sent by Atlassian Jira (v8.20.7#820007)