[jira] [Commented] (FLINK-27251) Timeout aligned to unaligned checkpoint barrier in the output buffers of an upstream subtask

fanrui (Jira) Wed, 11 May 2022 06:46:06 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-27251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534911#comment-17534911
 ]


fanrui commented on FLINK-27251:
--------------------------------

Thanks for your comments. 

For comment 1 and 2, you're right. It's just a temporary and non-generic 
version, I try to fix them asap. Or could I submit a PR after I fixed and add 
more unit tests? The PR can add some comments and more public, it's easier to 
discuss.

For comment 3, it does't extend the time for aligned checkpoint when the back 
pressure isn't severe. There are 2 cases to call the outputFuture.complete.
 * When all barriers are sent downstream quickly, and an empty output buffer is 
returned, the CP ends.(when the back pressure isn't severe)  [code 
link|https://github.com/1996fanrui/flink/blob/f9738c05608fc24a15e17ca61c194136a66398e4/flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/PipelinedSubpartition.java#L435]
 * Or wait for the timer to trigger, snapshot these output buffers before the 
barrier, and the CP ends.(when the back pressure is severe)  [code 
link|https://github.com/1996fanrui/flink/blob/f9738c05608fc24a15e17ca61c194136a66398e4/flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/PipelinedSubpartition.java#L358]

For comment 4, you're right. ChannelStateWriter will wait for the 
outputdataFuture. Actually I want to optimize it. Use something like 
CompletionService to consume completed futures instead of FIFO. Of course, 
finishOutput must be guaranteed to be processed last.

> Timeout aligned to unaligned checkpoint barrier in the output buffers of an 
> upstream subtask
> --------------------------------------------------------------------------------------------
>
>                 Key: FLINK-27251
>                 URL: https://issues.apache.org/jira/browse/FLINK-27251
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.14.0, 1.15.0
>            Reporter: fanrui
>            Priority: Major
>             Fix For: 1.16.0
>
>
> After FLINK-23041, the downstream task can be switched UC when {_}currentTime 
> - triggerTime > timeout{_}. But the downstream task still needs wait for all 
> barriers of upstream. 
> If the back pressure is serve, the downstream task cannot receive all barrier 
> within CP timeout, causes CP to fail.
>  
> Can we support upstream Task switching from Aligned to UC? It means that when 
> the barrier cannot be sent from the output buffer to the downstream task 
> within the 
> [execution.checkpointing.aligned-checkpoint-timeout|https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/config/#execution-checkpointing-aligned-checkpoint-timeout],
>  the upstream task switches to UC and takes a snapshot of the data before the 
> barrier in the output buffer.
>  
> Hi [~akalashnikov] , please help take a look in your free time, thanks a lot.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (FLINK-27251) Timeout aligned to unaligned checkpoint barrier in the output buffers of an upstream subtask

Reply via email to