[ 
https://issues.apache.org/jira/browse/FLINK-18662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17205396#comment-17205396
 ] 

Piotr Nowojski commented on FLINK-18662:
----------------------------------------

I’ve found an issue with “processed data during alignment” metric. For 
unaligned checkpoints the metrics that would be the most interesting are:
1. persisted in-flight data in the ckeckpoint (that’s easy to do)
2. something to compare, how persisted in-flight data compare to processed data 
during the time in-flight data were being persisted

If number of processed data in 2. is close to 1., UC do not make much sense. 
But the number that we are looking for here, is not “processed during 
alignment”, but “processed during checkpoint” (including the async phase). 
"processed during alignment" would be the amount of data processed between 
first and last received checkpoint barrier.

So I'm going to make this adjustment in my pr (instead of "during alignment" 
calculate the new metrics "during checkpoint")

> Provide more detailed metrics why unaligned checkpoint is taking long time
> --------------------------------------------------------------------------
>
>                 Key: FLINK-18662
>                 URL: https://issues.apache.org/jira/browse/FLINK-18662
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Metrics, Runtime / Network
>    Affects Versions: 1.11.1
>            Reporter: Piotr Nowojski
>            Assignee: Piotr Nowojski
>            Priority: Critical
>             Fix For: 1.12.0
>
>         Attachments: Screenshot 2020-07-21 at 11.50.02.png
>
>
> With unaligned checkpoint there can happen situation as in the attached 
> screenshot.
>  Task reports long end to end checkpoint time (~2h50min), ~0s sync time, 
> ~2h50min async time, ~0s start delay. It means that task received first 
> checkpoint barrier from one of the channels very quickly (~0s), sync part was 
> quick, but we do not know why async part was taking so long. It could be 
> because of three things:
> # long operator state IO writes
> # long spilling of in-flight data
> # long time to receive the final checkpoint barrier from the last lagging 
> channel
> First and second are probably indistinguishable and the difference between 
> them doesn't matter much for analyzing. However the last one is quite 
> different. It might be independent of the IO, and we are missing this 
> information. 
> Maybe we could report it as "alignment duration" and while we are at it, we 
> could also report amount of spilled in-flight data for unaligned checkpoints 
> as "alignment buffered"? 
> Ideally we should report it as new metrics, but that leaves a question how to 
> display it in the UI, with limited space available. Maybe it could be 
> reported as:
> ||Alignment Buffered||Alignment Duration||
> |0 B (632 MB)|0ms (2h 49m 32s)|
> Where the values in the parenthesis would come from unaligned checkpoints. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to