Re: [DISCUSS] FLIP-384: Introduce TraceReporter and use it to create checkpointing and recovery traces

Zakelly Lan Tue, 07 Nov 2023 20:29:59 -0800

Hi Piotr,

Happy to see the trace! Thanks for this proposal.


One minor question: It is mentioned in the interface of Span:

Currently we don't support traces with multiple spans. Each span is
> self-contained and represents things like a checkpoint or recovery.


Does it mean the inclusion and subdivision relationships of spans defined
by "parent_id" are not supported? I think it is a very necessary feature
for the trace.

In addition to checkpoint and recovery, I believe the trace would also be
valuable for performance tuning. If Flink can trace and visualize the time
cost of each operator and stage for a sampled record, users would be able
to easily determine the end-to-end latency and identify performance issues
for optimization. Looking forward to seeing these in the future.

Best,
Zakelly


On Tue, Nov 7, 2023 at 6:27 PM Piotr Nowojski <[email protected]> wrote:

> Hi Rui,
>
> Thanks for the comments!
>
> > 1. I see the trace just supports Span? Does it support trace events?
> > I'm not sure whether tracing events is reasonable for TraceReporter.
> > If it supports, flink can report checkpoint and checkpoint path
> proactively.
> > Currently, checkpoint lists or the latest checkpoint can only be fetched
> > by external components or platforms. And report is more timely and
> > efficient than fetch.
>
> No, currently the `TraceReporter` that I'm introducing supports only single
> span traces.
> So currently neither events on their own, nor events inside spans are not
> supported.
> This is done just for the sake of simplicity, and test out the basic
> functionality. But I think,
> those currently missing features should be added at some point in
> the future.
>
> About structured logging (basically events?) I vaguely remember some
> discussions about
> that. It might be a much larger topic, so I would prefer to leave it out of
> the scope of this
> FLIP.
>
> > 2. This FLIP just monitors the checkpoint and task recovery, right?
>
> Yes, it only adds single span traces for checkpointing and
> recovery/initialisation - one
> span per whole job per either recovery/initialization process or per each
> checkpoint.
>
> > Could we add more operations in this FLIP? In our production, we
> > added a lot of trace reporters for job starts and scheduler operation.
> > They are useful if some jobs start slowly, because they will affect
> > the job availability. For example:
> > - From JobManager process is started to JobGraph is created
> > - From JobGraph is created to JobMaster is created
> > - From JobMaster is created to job is running
> > - From start request tm from yarn or kubernetes to all tms are ready
> > - etc
>
> I think those could be indeed useful. If you would like to contribute them
> in the future,
> I would be happy to review the FLIP for it :)
>
> > Of course, this FLIP doesn't include them is fine for me. The first
> version
> > only initializes the interface and common operations, and we can add
> > more operations in the future
>
> Yes, that's exactly my thinking :)
>
> Best,
> Piotrek
>
> wt., 7 lis 2023 o 10:05 Rui Fan <[email protected]> napisał(a):
>
> > Hi Piotr,
> >
> > Thanks for driving this proposal! The trace reporter is useful to
> > check a lot of duration monitors inside of Flink.
> >
> > I have some questions about this proposal:
> >
> > 1. I see the trace just supports Span? Does it support trace events?
> > I'm not sure whether tracing events is reasonable for TraceReporter.
> > If it supports, flink can report checkpoint and checkpoint path
> > proactively.
> > Currently, checkpoint lists or the latest checkpoint can only be fetched
> > by external components or platforms. And report is more timely and
> > efficient than fetch.
> >
> > 2. This FLIP just monitors the checkpoint and task recovery, right?
> > Could we add more operations in this FLIP? In our production, we
> > added a lot of trace reporters for job starts and scheduler operation.
> > They are useful if some jobs start slowly, because they will affect
> > the job availability. For example:
> > - From JobManager process is started to JobGraph is created
> > - From JobGraph is created to JobMaster is created
> > - From JobMaster is created to job is running
> > - From start request tm from yarn or kubernetes to all tms are ready
> > - etc
> >
> > Of course, this FLIP doesn't include them is fine for me. The first
> version
> > only initializes the interface and common operations, and we can add
> > more operations in the future.
> >
> > Best,
> > Rui
> >
> > On Tue, Nov 7, 2023 at 4:31 PM Piotr Nowojski <[email protected]>
> > wrote:
> >
> > > Hi all!
> > >
> > > I would like to start a discussion on FLIP-384: Introduce TraceReporter
> > and
> > > use it to create checkpointing and recovery traces [1].
> > >
> > > This proposal intends to improve observability of Flink's Checkpointing
> > and
> > > Recovery/Initialization operations, by adding support for reporting
> > traces
> > > from Flink. In the future, reporting traces can be of course used for
> > other
> > > use cases and also by users.
> > >
> > > There are also two other follow up FLIPS, FLIP-385 [2] and FLIP-386
> [3],
> > > which expand the basic functionality introduced in FLIP-384 [1].
> > >
> > > Please let me know what you think!
> > >
> > > Best,
> > > Piotr Nowojski
> > >
> > > [1]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-384%3A+Introduce+TraceReporter+and+use+it+to+create+checkpointing+and+recovery+traces
> > > [2]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-385%3A+Add+OpenTelemetryTraceReporter+and+OpenTelemetryMetricReporter
> > > [3]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-386%3A+Support+adding+custom+metrics+in+Recovery+Spans
> > >
> >
>

Re: [DISCUSS] FLIP-384: Introduce TraceReporter and use it to create checkpointing and recovery traces

Reply via email to