[jira] [Closed] (FLINK-33695) FLIP-384: Introduce TraceReporter and use it to create checkpointing and recovery traces

Piotr Nowojski (Jira) Fri, 05 Jan 2024 08:26:05 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-33695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Piotr Nowojski closed FLINK-33695.
----------------------------------
    Resolution: Fixed

> FLIP-384: Introduce TraceReporter and use it to create checkpointing and 
> recovery traces
> ----------------------------------------------------------------------------------------
>
>                 Key: FLINK-33695
>                 URL: https://issues.apache.org/jira/browse/FLINK-33695
>             Project: Flink
>          Issue Type: New Feature
>          Components: Runtime / Checkpointing, Runtime / Metrics
>            Reporter: Piotr Nowojski
>            Assignee: Piotr Nowojski
>            Priority: Major
>             Fix For: 1.19.0
>
>
> https://cwiki.apache.org/confluence/x/TguZE
> *Motivation*
> Currently Flink has a limited observability of checkpoint and recovery 
> processes.
> For checkpointing Flink has a very detailed overview in the Flink WebUI, 
> which works great in many use cases, however it’s problematic if one is 
> operating multiple Flink clusters, or if cluster/JM dies. Additionally there 
> are a couple of metrics (like lastCheckpointDuration or lastCheckpointSize), 
> however those metrics have a couple of issues:
> * They are reported and refreshed periodically, depending on the 
> MetricReporter settings, which doesn’t take into account checkpointing 
> frequency.
> ** If checkpointing interval > metric reporting interval, we would be 
> reporting the same values multiple times.
> ** If checkpointing interval < metric reporting interval, we would be 
> randomly dropping metrics for some of the checkpoints.
> For recovery we are missing even the most basic of the metrics and Flink 
> WebUI support. Also given the fact that recovery is even less frequent 
> compared to checkpoints, adding recovery metrics would have even bigger 
> problems with unnecessary reporting the same values.
> In this FLIP I’m proposing to add support for reporting traces/spans 
> (example: Traces) and use this mechanism to report checkpointing and recovery 
> traces. I hope in the future traces will also prove useful in other areas of 
> Flink like job submission, job state changes, ... . Moreover as the API to 
> report traces will be added to the MetricGroup , users will be also able to 
> access this API. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Closed] (FLINK-33695) FLIP-384: Introduce TraceReporter and use it to create checkpointing and recovery traces

Reply via email to