Piotr Nowojski created FLINK-18662:
--------------------------------------

             Summary: Provide more detailed metrics why unaligned checkpoint is 
taking long time
                 Key: FLINK-18662
                 URL: https://issues.apache.org/jira/browse/FLINK-18662
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Metrics, Runtime / Network
    Affects Versions: 1.11.1
            Reporter: Piotr Nowojski
             Fix For: 1.12.0
         Attachments: Screenshot 2020-07-21 at 11.50.02.png

With unaligned checkpoint there can happen situation as in the attached 
screenshot.

 Task reports long end to end checkpoint time (~2h50min), ~0s sync time, 
~2h50min async time, ~0s start delay. It means that task received first 
checkpoint barrier from one of the channels very quickly (~0s), sync part was 
quick, but we do not know why async part was taking so long. It could be 
because of three things:
# long operator state IO writes
# long spilling of in-flight data
# long time to receive the final checkpoint barrier from the last lagging 
channel

First and second are probably indistinguishable and the difference between them 
doesn't matter much for analyzing. However the last one is quite different. It 
might be independent of the IO, and we are missing this information. 

Maybe we could report it as "alignment duration" and while we are at it, we 
could also report amount of spilled in-flight data for unaligned checkpoints as 
"alignment buffered"? 

Ideally we should report it as new metrics, but that leaves a question how to 
display it in the UI, with limited space available. Maybe it could be reported 
as:

||Alignment Buffered||Alignment Duration||
|0 B (632 MB)|0ms (2h 49m 32s)|

Where the values in the parenthesis would come from unaligned checkpoints. 




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to