[ 
https://issues.apache.org/jira/browse/FLINK-27251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534325#comment-17534325
 ] 

Piotr Nowojski commented on FLINK-27251:
----------------------------------------

Thanks for rising the issue [~fanrui]. Yes, this is a known problem. While 
developing the unaligned checkpoints, and especially when adding the timeouts 
support, the timeouts proved very difficult to implement, causing lot's of 
critical bugs and requiring a lot of effort to debug data corruption and 
stabilise the feature. All in all, in the retrospect, our feel was that adding 
the timeouts was not worth the effort and that users should be just as fine 
using the unaligned checkpoints without any timeout. At one point I was even 
thinking about removing feature all together in order to simplify the code base.

The main motivation issue is that without backpressure unaligned checkpoints 
will capture only very negligible amount of the in-flight data, and with 
backpressure, you most likely want to have fully unaligned checkpoints anyway, 
so actually we don't see a clear benefit of enabling timeout in the first 
place. From this perspective, I would like to first discuss if we even need 
this feature. 

Secondly, assuming that we really need it, one would have to very carefully 
think how to implement it. Note that if you exceed the time limit on the 
upstream subtask's output to send aligned barriers, when you want to convert 
those barriers to unaligned checkpoint, this subtask has already completed the 
checkpoint. While the timeout process would have to append the output in-flight 
data to the checkpoint.

> Solve the problem that upstream Task cannot be switched to Unaligned 
> Checkpoint
> -------------------------------------------------------------------------------
>
>                 Key: FLINK-27251
>                 URL: https://issues.apache.org/jira/browse/FLINK-27251
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.14.0, 1.15.0
>            Reporter: fanrui
>            Priority: Major
>             Fix For: 1.16.0
>
>
> After FLINK-23041, the downstream task can be switched UC when {_}currentTime 
> - triggerTime > timeout{_}. But the downstream task still needs wait for all 
> barriers of upstream. 
> If the back pressure is serve, the downstream task cannot receive all barrier 
> within CP timeout, causes CP to fail.
>  
> Can we support upstream Task switching from Aligned to UC? It means that when 
> the barrier cannot be sent from the output buffer to the downstream task 
> within the 
> [execution.checkpointing.aligned-checkpoint-timeout|https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/config/#execution-checkpointing-aligned-checkpoint-timeout],
>  the upstream task switches to UC and takes a snapshot of the data before the 
> barrier in the output buffer.
>  
> Hi [~akalashnikov] , please help take a look in your free time, thanks a lot.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to