[ 
https://issues.apache.org/jira/browse/FLINK-17531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Flink Jira Bot updated FLINK-17531:
-----------------------------------
      Labels: auto-deprioritized-major auto-deprioritized-minor  (was: 
auto-deprioritized-major stale-minor)
    Priority: Not a Priority  (was: Minor)

This issue was labeled "stale-minor" 7 days ago and has not received any 
updates so it is being deprioritized. If this ticket is actually Minor, please 
raise the priority and ask a committer to assign you the issue or revive the 
public discussion.


> Add a new checkpoint Guage metric: elapsedTimeSinceLastCompletedCheckpoint
> --------------------------------------------------------------------------
>
>                 Key: FLINK-17531
>                 URL: https://issues.apache.org/jira/browse/FLINK-17531
>             Project: Flink
>          Issue Type: New Feature
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.10.0
>            Reporter: Steven Zhen Wu
>            Priority: Not a Priority
>              Labels: auto-deprioritized-major, auto-deprioritized-minor
>
> like to discuss the value of a new checkpoint Guage metric: 
> `elapsedTimeSinceLastCompletedCheckpoint`. Main motivation is for alerting. I 
> know reasons below are somewhat related to our setup. Hence want to explore 
> the interest of the community.
>  
> *What do we want to achieve?*
> We want to alert if no successful checkpoint happened for a specific period. 
> With this new metric, we can set up a simple alerting rule like `alert if 
> elapsedTimeSinceLastCompletedCheckpoint > N minutes`. It is a good alerting 
> pattern of `time since last success`. We found 
> `elapsedTimeSinceLastCompletedCheckpoint` metric very easy and intuitive to 
> set up alert against.
>  
> *What out existing checkpoint metrics?*
> `numberOfCompletedCheckpoints`. We can set up an alert like `alert if 
> derivative(numberOfCompletedCheckpoints) == 0 for N minutes`. However, it is 
> an anti-pattern for our alerting system, as it is looking for lack of good 
> signal (vs explicit bad signal). Such an anti-pattern is easier to suffer 
> false alarm problem when there is occasional metric drop or alerting system 
> processing issue.
>  
> `numberOfFailedCheckpoints`. That is an explicit failure signal, which is 
> good. We can set up alert like `alert if 
> derivative(numberOfFailedCheckpoints) > 0 in X out Y minutes`. We have some 
> high-parallelism large-state jobs. Their normal checkpoint duration is <1-2 
> minutes. However, when recovering from an outage with large backlog, 
> sometimes subtasks from one or a few containers experienced super high back 
> pressure. It took checkpoint barrier sometimes more than an hour to travel 
> through the DAG to those heavy back pressured subtasks. Causes of the back 
> pressure are likely due to multi-tenancy environment and performance 
> variation among containers. Instead of letting checkpoint to time out in this 
> case, we decided to increase checkpoint timeout value to crazy long value 
> (like 2 hours). With that, we kind of missed the explicit "bad" signal of 
> failed/timed out checkpoint.
>  
> In theory, one could argue that we can set checkpoint timeout to infinity. It 
> is always better to have a long but completed checkpoint than a timed out 
> checkpoint, as timed out checkpoint basically give up its positions in the 
> queue and new checkpoint just reset the positions back to the end of the 
> queue . Note that we are using at least checkpoint semantics. So there is no 
> barrier alignment concern. FLIP-76 (unaligned checkpoints) can help 
> checkpoint dealing with back pressure better. It is not ready now and also 
> has its limitations. That is a separate discussion.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to