[ https://issues.apache.org/jira/browse/FLINK-33695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Piotr Nowojski closed FLINK-33695. ---------------------------------- Resolution: Fixed > FLIP-384: Introduce TraceReporter and use it to create checkpointing and > recovery traces > ---------------------------------------------------------------------------------------- > > Key: FLINK-33695 > URL: https://issues.apache.org/jira/browse/FLINK-33695 > Project: Flink > Issue Type: New Feature > Components: Runtime / Checkpointing, Runtime / Metrics > Reporter: Piotr Nowojski > Assignee: Piotr Nowojski > Priority: Major > Fix For: 1.19.0 > > > https://cwiki.apache.org/confluence/x/TguZE > *Motivation* > Currently Flink has a limited observability of checkpoint and recovery > processes. > For checkpointing Flink has a very detailed overview in the Flink WebUI, > which works great in many use cases, however it’s problematic if one is > operating multiple Flink clusters, or if cluster/JM dies. Additionally there > are a couple of metrics (like lastCheckpointDuration or lastCheckpointSize), > however those metrics have a couple of issues: > * They are reported and refreshed periodically, depending on the > MetricReporter settings, which doesn’t take into account checkpointing > frequency. > ** If checkpointing interval > metric reporting interval, we would be > reporting the same values multiple times. > ** If checkpointing interval < metric reporting interval, we would be > randomly dropping metrics for some of the checkpoints. > For recovery we are missing even the most basic of the metrics and Flink > WebUI support. Also given the fact that recovery is even less frequent > compared to checkpoints, adding recovery metrics would have even bigger > problems with unnecessary reporting the same values. > In this FLIP I’m proposing to add support for reporting traces/spans > (example: Traces) and use this mechanism to report checkpointing and recovery > traces. I hope in the future traces will also prove useful in other areas of > Flink like job submission, job state changes, ... . Moreover as the API to > report traces will be added to the MetricGroup , users will be also able to > access this API. -- This message was sent by Atlassian Jira (v8.20.10#820010)