[ https://issues.apache.org/jira/browse/FLINK-33856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17803459#comment-17803459 ]
Jufang He edited comment on FLINK-33856 at 1/5/24 8:39 AM: ----------------------------------------------------------- [~pnowojski] Thanks for your advice. It seems that we need children spans per each subtask/task, so that we can statistics more detailed task-level information and more conveniently to locate the bottleneck of the cp making. such as syncDuration /async duration/ the latency to write file /the latency to close file, of course 'writeRate' is no longer needed. IMO, I prefer to report metrics separately for different TMs. Because our production environment has a large number of TM and subtasks, if the changelog checkpoint is enabled, the checkpoint may be frequent. I am worried that a large amount of data aggregation to JM may have performance problems. Maybe a new flip that supports task-level trace reporter can builded ? I’m willing to participate in the development. was (Author: JIRAUSER302059): [~pnowojski] Thanks for your advice. It seems that we need children spans per each subtask/task, so that we can statistics more detailed task-level information and more conveniently to locate the bottleneck of the cp making. such as syncDuration /async duration/ the latency to write file /the latency to close file, of course 'writeRate' is no longer needed. IMO, I prefer to report metrics separately for different TMs. Because our production environment has a large number of TM and subtasks, if the changelog checkpoint is enabled, the checkpoint may be frequent. I am worried that a large amount of data aggregation to JM may have performance problems. Maybe a new flip can builded that supports task-level trace reporter? I’m willing to participate in the development. > Add metrics to monitor the interaction performance between task and external > storage system in the process of checkpoint making > ------------------------------------------------------------------------------------------------------------------------------- > > Key: FLINK-33856 > URL: https://issues.apache.org/jira/browse/FLINK-33856 > Project: Flink > Issue Type: Improvement > Components: Runtime / Checkpointing > Affects Versions: 1.18.0 > Reporter: Jufang He > Assignee: Jufang He > Priority: Major > Labels: pull-request-available > > When Flink makes a checkpoint, the interaction performance with the external > file system has a great impact on the overall time-consuming. Therefore, it > is easy to observe the bottleneck point by adding performance indicators when > the task interacts with the external file storage system. These include: the > rate of file write , the latency to write the file, the latency to close the > file. > In flink side add the above metrics has the following advantages: convenient > statistical different task E2E time-consuming; do not need to distinguish the > type of external storage system, can be unified in the > FsCheckpointStreamFactory. -- This message was sent by Atlassian Jira (v8.20.10#820010)