[jira] [Comment Edited] (FLINK-18662) Provide more detailed metrics why unaligned checkpoint is taking long time

Piotr Nowojski (Jira) Mon, 27 Jul 2020 06:41:57 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-18662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17165705#comment-17165705
 ]


Piotr Nowojski edited comment on FLINK-18662 at 7/27/20, 1:40 PM:
------------------------------------------------------------------

What about the following proposal?
# Let's drop "buffered during alignment", it doesn't have much sense right now 
(since FLINK-16404).
# "alignment duration" makes sense both for unaligned and aligned checkpoints. 
# There might be an extra value of having "in-flight data during alignment" 
being split between into "processed data during alignment" and "persisted data 
during alignment". For example when task is processing data very quickly, both 
metrics will be equal (or close to being equal) and that would be a nice signal 
to the user, that he is wasting resources and he should turn off unaligned 
checkpoints. Those two metrics could be displayed either in separate columns or 
in a single one like this:
||Processed (Persisted) Data||Alignment Duration||
|245 MB (632 MB)|2h 49m 32s|
(for aligned checkpoints persisted bytes would be always zero)

I hope that we could add some tooltips with a more detailed explanation.

When I think about other unaligned checkpoints triggers (timeout/size limit), I 
think those three metrics (duration, processed and persisted data) are still 
making sense, and should be enough.


was (Author: pnowojski):
What about the following proposal?
# Let's drop "buffered during alignment", it doesn't have much sense right now 
(since FLINK-16404).
# "alignment duration" makes sense both for unaligned and aligned checkpoints. 
# There might be an extra value of having "in-flight data during alignment" 
being split between before "processed data during alignment" and after 
"persisted data during alignment" the trigger. Before we are processing the 
data, after we are still processing them but also spilling them. For example 
when task is processing data very quickly, both metrics will be equal (or close 
to being equal) and that would be a nice signal to the user, that he is wasting 
resources and he should turn off unaligned checkpoints. Those two metrics could 
be displayed either in separate columns or in a single one like this:
||Processed (Persisted) Data||Alignment Duration||
|245 MB (632 MB)|2h 49m 32s|
(for aligned checkpoints persisted bytes would be always zero)

I hope that we could add some tooltips with a more detailed explanation.

When I think about other unaligned checkpoints triggers (timeout/size limit), I 
think those three metrics (duration, processed and persisted data) are still 
making sense, and should be enough.

> Provide more detailed metrics why unaligned checkpoint is taking long time
> --------------------------------------------------------------------------
>
>                 Key: FLINK-18662
>                 URL: https://issues.apache.org/jira/browse/FLINK-18662
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Metrics, Runtime / Network
>    Affects Versions: 1.11.1
>            Reporter: Piotr Nowojski
>            Priority: Critical
>             Fix For: 1.12.0
>
>         Attachments: Screenshot 2020-07-21 at 11.50.02.png
>
>
> With unaligned checkpoint there can happen situation as in the attached 
> screenshot.
>  Task reports long end to end checkpoint time (~2h50min), ~0s sync time, 
> ~2h50min async time, ~0s start delay. It means that task received first 
> checkpoint barrier from one of the channels very quickly (~0s), sync part was 
> quick, but we do not know why async part was taking so long. It could be 
> because of three things:
> # long operator state IO writes
> # long spilling of in-flight data
> # long time to receive the final checkpoint barrier from the last lagging 
> channel
> First and second are probably indistinguishable and the difference between 
> them doesn't matter much for analyzing. However the last one is quite 
> different. It might be independent of the IO, and we are missing this 
> information. 
> Maybe we could report it as "alignment duration" and while we are at it, we 
> could also report amount of spilled in-flight data for unaligned checkpoints 
> as "alignment buffered"? 
> Ideally we should report it as new metrics, but that leaves a question how to 
> display it in the UI, with limited space available. Maybe it could be 
> reported as:
> ||Alignment Buffered||Alignment Duration||
> |0 B (632 MB)|0ms (2h 49m 32s)|
> Where the values in the parenthesis would come from unaligned checkpoints. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-18662) Provide more detailed metrics why unaligned checkpoint is taking long time

Reply via email to