[ 
https://issues.apache.org/jira/browse/FLINK-18662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17210194#comment-17210194
 ] 

Piotr Nowojski commented on FLINK-18662:
----------------------------------------

This is how the changes will look like in the Web UI

 !checkpoint_monitoring-history-subtasks.png!  
!checkpoint_monitoring-history.png! 

> Provide more detailed metrics why unaligned checkpoint is taking long time
> --------------------------------------------------------------------------
>
>                 Key: FLINK-18662
>                 URL: https://issues.apache.org/jira/browse/FLINK-18662
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Metrics, Runtime / Network
>    Affects Versions: 1.11.1
>            Reporter: Piotr Nowojski
>            Assignee: Piotr Nowojski
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.12.0
>
>         Attachments: Screenshot 2020-07-21 at 11.50.02.png, 
> checkpoint_monitoring-history-subtasks.png, checkpoint_monitoring-history.png
>
>
> With unaligned checkpoint there can happen situation as in the attached 
> screenshot.
>  Task reports long end to end checkpoint time (~2h50min), ~0s sync time, 
> ~2h50min async time, ~0s start delay. It means that task received first 
> checkpoint barrier from one of the channels very quickly (~0s), sync part was 
> quick, but we do not know why async part was taking so long. It could be 
> because of three things:
> # long operator state IO writes
> # long spilling of in-flight data
> # long time to receive the final checkpoint barrier from the last lagging 
> channel
> First and second are probably indistinguishable and the difference between 
> them doesn't matter much for analyzing. However the last one is quite 
> different. It might be independent of the IO, and we are missing this 
> information. 
> Maybe we could report it as "alignment duration" and while we are at it, we 
> could also report amount of spilled in-flight data for unaligned checkpoints 
> as "alignment buffered"? 
> Ideally we should report it as new metrics, but that leaves a question how to 
> display it in the UI, with limited space available. Maybe it could be 
> reported as:
> ||Alignment Buffered||Alignment Duration||
> |0 B (632 MB)|0ms (2h 49m 32s)|
> Where the values in the parenthesis would come from unaligned checkpoints. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to