[ https://issues.apache.org/jira/browse/FLINK-18662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chesnay Schepler updated FLINK-18662: ------------------------------------- Component/s: Runtime / Web Frontend > Provide more detailed metrics why unaligned checkpoint is taking long time > -------------------------------------------------------------------------- > > Key: FLINK-18662 > URL: https://issues.apache.org/jira/browse/FLINK-18662 > Project: Flink > Issue Type: Improvement > Components: Runtime / Metrics, Runtime / Network, Runtime / Web > Frontend > Affects Versions: 1.11.1 > Reporter: Piotr Nowojski > Assignee: Piotr Nowojski > Priority: Critical > Labels: pull-request-available > Fix For: 1.12.0 > > Attachments: Screenshot 2020-07-21 at 11.50.02.png, > checkpoint_monitoring-history-subtasks.png, checkpoint_monitoring-history.png > > > With unaligned checkpoint there can happen situation as in the attached > screenshot. > Task reports long end to end checkpoint time (~2h50min), ~0s sync time, > ~2h50min async time, ~0s start delay. It means that task received first > checkpoint barrier from one of the channels very quickly (~0s), sync part was > quick, but we do not know why async part was taking so long. It could be > because of three things: > # long operator state IO writes > # long spilling of in-flight data > # long time to receive the final checkpoint barrier from the last lagging > channel > First and second are probably indistinguishable and the difference between > them doesn't matter much for analyzing. However the last one is quite > different. It might be independent of the IO, and we are missing this > information. > Maybe we could report it as "alignment duration" and while we are at it, we > could also report amount of spilled in-flight data for unaligned checkpoints > as "alignment buffered"? > Ideally we should report it as new metrics, but that leaves a question how to > display it in the UI, with limited space available. Maybe it could be > reported as: > ||Alignment Buffered||Alignment Duration|| > |0 B (632 MB)|0ms (2h 49m 32s)| > Where the values in the parenthesis would come from unaligned checkpoints. -- This message was sent by Atlassian Jira (v8.3.4#803005)