Re: [DISCUSS] FLIP-384: Introduce TraceReporter and use it to create checkpointing and recovery traces

Piotr Nowojski Wed, 08 Nov 2023 01:10:36 -0800

Hi Zakelly,

Thanks for the comments. Quick answer for both of your questions would be
that it probably should be
left as a future work. For more detailed answers please take a look below :)


> Does it mean the inclusion and subdivision relationships of spans defined
> by "parent_id" are not supported? I think it is a very necessary feature
> for the trace.

Yes exactly, that is the current limitation. This could be solved somehow
one way or another in the future.

Support for reporting multi span traces all at once - for example
`CheckpointStatsTracker` running JM,
could upon checkpoint completion create in one place the whole structure of
parent spans, to have for
example one span per each subtask. This would be a relatively easy follow
up.

However, if we would like to create true distributed traces, with spans
reported from many different
components, potentially both on JM and TM, the problem is a bit deeper. The
issue in that case is how
to actually fill out `parrent_id` and `trace_id`? Passing some context
entity as a java object would be
unfeasible. That would require too many changes in too many places. I think
the only realistic way
to do it, would be to have a deterministic generator of `parten_id` and
`trace_id` values.

For example we could create the parent trace/span of the checkpoint on JM,
and set those ids to
something like: `jobId#attemptId#checkpointId`. Each subtask then could
re-generate those ids
and subtasks' checkpoint span would have an id of
`jobId#attemptId#checkpointId#subTaskId`.
Note that this is just an example, as most likely distributed spans for
checkpointing do not make
sense, as we can generate them much easier on the JM anyway.

> In addition to checkpoint and recovery, I believe the trace would also be
> valuable for performance tuning. If Flink can trace and visualize the time
> cost of each operator and stage for a sampled record, users would be able
> to easily determine the end-to-end latency and identify performance issues
> for optimization. Looking forward to seeing these in the future.

I'm not sure if I understand the proposal - I don't know how traces could
be used for this purpose?
Traces are perfect for one of events (like checkpointing, recovery, etc),
not for continuous monitoring
(like processing records). That's what metrics are. Creating trace and
span(s) per each record would
be prohibitively expensive.

Unless you mean in batch/bounded jobs? Then yes, we could create a bounded
job trace, with spans
for every stage/task/subtask.

Best,
Piotrek


śr., 8 lis 2023 o 05:30 Zakelly Lan <[email protected]> napisał(a):

> Hi Piotr,
>
> Happy to see the trace! Thanks for this proposal.
>
> One minor question: It is mentioned in the interface of Span:
>
> Currently we don't support traces with multiple spans. Each span is
> > self-contained and represents things like a checkpoint or recovery.
>
>
> Does it mean the inclusion and subdivision relationships of spans defined
> by "parent_id" are not supported? I think it is a very necessary feature
> for the trace.
>
> In addition to checkpoint and recovery, I believe the trace would also be
> valuable for performance tuning. If Flink can trace and visualize the time
> cost of each operator and stage for a sampled record, users would be able
> to easily determine the end-to-end latency and identify performance issues
> for optimization. Looking forward to seeing these in the future.
>
> Best,
> Zakelly
>
>
> On Tue, Nov 7, 2023 at 6:27 PM Piotr Nowojski <[email protected]>
> wrote:
>
> > Hi Rui,
> >
> > Thanks for the comments!
> >
> > > 1. I see the trace just supports Span? Does it support trace events?
> > > I'm not sure whether tracing events is reasonable for TraceReporter.
> > > If it supports, flink can report checkpoint and checkpoint path
> > proactively.
> > > Currently, checkpoint lists or the latest checkpoint can only be
> fetched
> > > by external components or platforms. And report is more timely and
> > > efficient than fetch.
> >
> > No, currently the `TraceReporter` that I'm introducing supports only
> single
> > span traces.
> > So currently neither events on their own, nor events inside spans are not
> > supported.
> > This is done just for the sake of simplicity, and test out the basic
> > functionality. But I think,
> > those currently missing features should be added at some point in
> > the future.
> >
> > About structured logging (basically events?) I vaguely remember some
> > discussions about
> > that. It might be a much larger topic, so I would prefer to leave it out
> of
> > the scope of this
> > FLIP.
> >
> > > 2. This FLIP just monitors the checkpoint and task recovery, right?
> >
> > Yes, it only adds single span traces for checkpointing and
> > recovery/initialisation - one
> > span per whole job per either recovery/initialization process or per each
> > checkpoint.
> >
> > > Could we add more operations in this FLIP? In our production, we
> > > added a lot of trace reporters for job starts and scheduler operation.
> > > They are useful if some jobs start slowly, because they will affect
> > > the job availability. For example:
> > > - From JobManager process is started to JobGraph is created
> > > - From JobGraph is created to JobMaster is created
> > > - From JobMaster is created to job is running
> > > - From start request tm from yarn or kubernetes to all tms are ready
> > > - etc
> >
> > I think those could be indeed useful. If you would like to contribute
> them
> > in the future,
> > I would be happy to review the FLIP for it :)
> >
> > > Of course, this FLIP doesn't include them is fine for me. The first
> > version
> > > only initializes the interface and common operations, and we can add
> > > more operations in the future
> >
> > Yes, that's exactly my thinking :)
> >
> > Best,
> > Piotrek
> >
> > wt., 7 lis 2023 o 10:05 Rui Fan <[email protected]> napisał(a):
> >
> > > Hi Piotr,
> > >
> > > Thanks for driving this proposal! The trace reporter is useful to
> > > check a lot of duration monitors inside of Flink.
> > >
> > > I have some questions about this proposal:
> > >
> > > 1. I see the trace just supports Span? Does it support trace events?
> > > I'm not sure whether tracing events is reasonable for TraceReporter.
> > > If it supports, flink can report checkpoint and checkpoint path
> > > proactively.
> > > Currently, checkpoint lists or the latest checkpoint can only be
> fetched
> > > by external components or platforms. And report is more timely and
> > > efficient than fetch.
> > >
> > > 2. This FLIP just monitors the checkpoint and task recovery, right?
> > > Could we add more operations in this FLIP? In our production, we
> > > added a lot of trace reporters for job starts and scheduler operation.
> > > They are useful if some jobs start slowly, because they will affect
> > > the job availability. For example:
> > > - From JobManager process is started to JobGraph is created
> > > - From JobGraph is created to JobMaster is created
> > > - From JobMaster is created to job is running
> > > - From start request tm from yarn or kubernetes to all tms are ready
> > > - etc
> > >
> > > Of course, this FLIP doesn't include them is fine for me. The first
> > version
> > > only initializes the interface and common operations, and we can add
> > > more operations in the future.
> > >
> > > Best,
> > > Rui
> > >
> > > On Tue, Nov 7, 2023 at 4:31 PM Piotr Nowojski <[email protected]>
> > > wrote:
> > >
> > > > Hi all!
> > > >
> > > > I would like to start a discussion on FLIP-384: Introduce
> TraceReporter
> > > and
> > > > use it to create checkpointing and recovery traces [1].
> > > >
> > > > This proposal intends to improve observability of Flink's
> Checkpointing
> > > and
> > > > Recovery/Initialization operations, by adding support for reporting
> > > traces
> > > > from Flink. In the future, reporting traces can be of course used for
> > > other
> > > > use cases and also by users.
> > > >
> > > > There are also two other follow up FLIPS, FLIP-385 [2] and FLIP-386
> > [3],
> > > > which expand the basic functionality introduced in FLIP-384 [1].
> > > >
> > > > Please let me know what you think!
> > > >
> > > > Best,
> > > > Piotr Nowojski
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-384%3A+Introduce+TraceReporter+and+use+it+to+create+checkpointing+and+recovery+traces
> > > > [2]
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-385%3A+Add+OpenTelemetryTraceReporter+and+OpenTelemetryMetricReporter
> > > > [3]
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-386%3A+Support+adding+custom+metrics+in+Recovery+Spans
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-384: Introduce TraceReporter and use it to create checkpointing and recovery traces

Reply via email to