[ 
https://issues.apache.org/jira/browse/FLINK-33856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17803478#comment-17803478
 ] 

Piotr Nowojski edited comment on FLINK-33856 at 1/5/24 9:58 AM:
----------------------------------------------------------------

{quote}
Maybe a new flip that supports task-level trace reporter can builded ?  I’m 
willing to participate in the development.
{quote}
Please again check the FLIP-384 discussions. I was highlighting there a couple 
of difficulties:
{quote}
However, if we would like to create true distributed traces, with spans
reported from many different
components, potentially both on JM and TM, the problem is a bit deeper. The
issue in that case is how
to actually fill out `parrent_id` and `trace_id`? Passing some context
entity as a java object would be
unfeasible. That would require too many changes in too many places. I think
the only realistic way
to do it, would be to have a deterministic generator of `parten_id` and
`trace_id` values.

For example we could create the parent trace/span of the checkpoint on JM,
and set those ids to
something like: `jobId#attemptId#checkpointId`. Each subtask then could
re-generate those ids
and subtasks' checkpoint span would have an id of
`jobId#attemptId#checkpointId#subTaskId`.
Note that this is just an example, as most likely distributed spans for
checkpointing do not make
sense, as we can generate them much easier on the JM anyway.
{quote}
https://lists.apache.org/thread/7lql5f5q1np68fw1wc9trq3d9l2ox8f4

At the same time:
{quote}
 I am worried that a large amount of data aggregation to JM may have 
performance problems.
{quote}
I wouldn't worry about that too much. This data is already aggregated on the JM 
from all of the TMs via {{CheckpointMetricsBuilder}} and {{CheckpointMetrics}}. 
Besides, it's just a single RPC from subtask -> JM per checkpoint. If that 
becomes a problem, we would have problems in many different areas as well (for 
example {{notifyCheckpointCompleted}} is a very similar call but the other 
direction).

Also AFAIR there are/were different ideas how to solve this potential 
bottleneck in a more generic way (having multiple job coordinators in the 
cluster to spread the load).

[~hejufang001] I would suggest that both of yo chat offline about the scope of 
the changes in [~fanrui]'s FLIP and/or eventual division of work. I'm not sure 
if [~fanrui] plans to add per task/subtask spans for checkpoints and/or 
recovery.


was (Author: pnowojski):
{quote}
Maybe a new flip that supports task-level trace reporter can builded ?  I’m 
willing to participate in the development.
{quote}
Please again check the FLIP-384 discussions. I was highlighting there a couple 
of difficulties:
{quote}
However, if we would like to create true distributed traces, with spans
reported from many different
components, potentially both on JM and TM, the problem is a bit deeper. The
issue in that case is how
to actually fill out `parrent_id` and `trace_id`? Passing some context
entity as a java object would be
unfeasible. That would require too many changes in too many places. I think
the only realistic way
to do it, would be to have a deterministic generator of `parten_id` and
`trace_id` values.

For example we could create the parent trace/span of the checkpoint on JM,
and set those ids to
something like: `jobId#attemptId#checkpointId`. Each subtask then could
re-generate those ids
and subtasks' checkpoint span would have an id of
`jobId#attemptId#checkpointId#subTaskId`.
Note that this is just an example, as most likely distributed spans for
checkpointing do not make
sense, as we can generate them much easier on the JM anyway.
{quote}
https://lists.apache.org/thread/7lql5f5q1np68fw1wc9trq3d9l2ox8f4

At the same time:
{quote}
 I am worried that a large amount of data aggregation to JM may have 
performance problems.
{quote}
I wouldn't worry about that too much. This data is already aggregated on the JM 
from all of the TMs via {{CheckpointMetricsBuilder}} and {{CheckpointMetrics}}. 
Besides, it's just a single RPC from subtask -> JM per checkpoint. If that 
becomes a problem, we would have problems in many different areas as well (for 
example {{notifyCheckpointCompleted}} is a very similar call but the other 
direction).

Also AFAIR there are/were different ideas how to solve this potential 
bottleneck in a more generic way (having multiple job coordinators in the 
cluster to spread the load).



> Add metrics to monitor the interaction performance between task and external 
> storage system in the process of checkpoint making
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-33856
>                 URL: https://issues.apache.org/jira/browse/FLINK-33856
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.18.0
>            Reporter: Jufang He
>            Assignee: Jufang He
>            Priority: Major
>              Labels: pull-request-available
>
> When Flink makes a checkpoint, the interaction performance with the external 
> file system has a great impact on the overall time-consuming. Therefore, it 
> is easy to observe the bottleneck point by adding performance indicators when 
> the task interacts with the external file storage system. These include: the 
> rate of file write , the latency to write the file, the latency to close the 
> file.
> In flink side add the above metrics has the following advantages: convenient 
> statistical different task E2E time-consuming; do not need to distinguish the 
> type of external storage system, can be unified in the 
> FsCheckpointStreamFactory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to